SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1

Size: px
Start display at page:

Download "SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1"

Transcription

1 SPH 247 Statistical Analysis of Laboratory Data April 28, 2015 SPH 247 Statistics for Laboratory Data 1

2 Outline RNA-Seq for differential expression analysis Statistical methods for RNA-Seq: Structure and choices Criteria for comparison of statistical methods Results of standard packages on three data set Null performance Source of differences in the results Input data Tests Variance estimation Normalization Conclusions April 28, 2015 SPH 247 Statistics for Laboratory Data 2

3 RNA-Seq Gene expression is the transcription of the DNA in a gene into mrna, which (in many cases) is later translated into a protein. We can measure expression of a single gene with PCR or other assays. Gene expression arrays measure expression of many genes simultaneously using spots each of which contains a matching sequence to the gene sequence to be detected. But this can only detect what we already suspected might be there. April 28, 2015 SPH 247 Statistics for Laboratory Data 3

4 RNA-Seq For RNA-Seq, the RNA in the sample is reverse transcribed into the corresponding DNA sequence. Then the DNA fragments are sequenced (in an NGS sequencer, usually Illumina ) Each fragment is mapped to the reference genome The data to be analyzed are the number of fragments mapping to each gene in a table where the columns are samples and the rows are genes. April 28, 2015 SPH 247 Statistics for Laboratory Data 4

5 RNA-Seq This mapping can be complex We can choose to estimate isoforms or not We can choose how to handle ambiguous reads (omit or spread across genes) We can then use statistical analysis to determine when there is significantly more expression in one condition or another. This may or may not be better than an expression array depending on goals. April 28, 2015 SPH 247 Statistics for Laboratory Data 5

6 Analysis of RNA-Seq Data For each gene/exon/isoform (we will say gene from now on), and for each sample, we have a count of fragments mapping to that gene. In principle, we need to test whether the counts from one group are significantly larger than another. Or we may have more than one factor or variable that could be associated. In practice, we may (probably) need to normalize the samples first, and may need to import some information across genes. April 28, 2015 SPH 247 Statistics for Laboratory Data 6

7 Read Counts We assume that bioinformatic analysis has been conducted on the original RNA-Seq data for each sample and converted into reads K ij per gene i per sample j. If the sample were a big soup of mapped fragments, and a sample were taken of it, then the statistical distribution of the K ij would be multinomial with a very high dimension, yielding a vector of length 20,000 or so with average counts per item varying over many orders of magnitude. April 28, 2015 SPH 247 Statistics for Laboratory Data 7

8 Distribution of Read Counts Generally, in a multinomial distribution, the counts per variable are correlated, because a count in one cell eliminates a count in another, but in this case weakly so. So approximately, the individual count values would be binomial, with a very small proportion p. This in turn is approximately Poisson, which might be a plausible starting place. April 28, 2015 SPH 247 Statistics for Laboratory Data 8

9 Poisson Problems The Poisson distribution has a single parameter λ, which controls both the mean λ and the variance λ. For a given gene i, the Poisson parameter λ i depends on the actual count of transcripts mapped to gene i in the soup, the total number of transcripts in the soup, and factors describing biases in production of reads from a transcript (including but not limited to the size of the gene). A hypothesis test of whether sample 1 differs in expression of gene i from sample 2, is a useless test two samples always differ. It is not a test of treatment vs. control. April 28, 2015 SPH 247 Statistics for Laboratory Data 9

10 Poisson Problems If X and Y are in dependent Poisson variates with parameters λ X and λ Y, then the hypothesis H 0 : λ X = λ Y can be tested by (for example) the statistic (X Y)/2L, where L = X + Y. This is because Var(X Y) = λ X + λ Y. But if X is from a treatment sample and Y from a control sample, this is in no way a reasonable test of whether treatment differs from control. It does not take biological variability into account at all. April 28, 2015 SPH 247 Statistics for Laboratory Data 10

11 Freezer Cells grown to 50% confluence in 8 dishes 4 treated with TNF-α 4 controls Other (biological) variability Actual transcript count for one gene in 8 dishes Poisson variability? RNA-Seq mapped read count for one gene in 8 dishes April 28, 2015 SPH 247 Statistics for Laboratory Data 11

12 Poisson Variability Is at best the technical variation Even then, the approximation is poor A better model for the overall experimental variability is perhaps a mixture of Poissons. Each sample under the same condition has a different actual transcript count before sampling, which means that λ ij for gene i and sample j differs from sample to sample. So the actual observed mapped reads count for a given gene across samples in the same condition can better be considered a mixture of Poisson random variables. April 28, 2015 SPH 247 Statistics for Laboratory Data 12

13 The Negative Binomial Given a series of Bernoulli trials (coin flips) with chance of success p on each trial, if we let X be the number of successes at the time when r th failure occurs, then X has a negative binomial distribution NB(r;p). For example, if r is 3, and the sequence is SSFSFSSSSFSSF, then X is 7 and the length of the sequence is 10. The mean of X is pr/(1-p) and the variance is pr/(1-p) 2. April 28, 2015 SPH 247 Statistics for Laboratory Data 13

14 The Negative Binomial An more useful formulation is that a negative binomial can be written as a gamma mixture of Poisson random variables. If an observed count is Poisson, with a random mean λ that differs from time to time and which is gamma distributed across samples, then the count overall is negative binomial. April 28, 2015 SPH 247 Statistics for Laboratory Data 14

15 Gamma Mixtures The gamma distribution (same as the chi-squared) is an asymmetric distribution defined on [0, ) with two parameters, the shape k and the scale θ. The mean is kθ and the variance is kθ 2. NB(r; p) is produced by a gamma mixture of Poissons with k = r and θ = p/(1-p). Otherwise put, if we can estimate the mean μ and variance σ 2, then the parameters of the negative binomial can be determined from those. April 28, 2015 SPH 247 Statistics for Laboratory Data 15

16 The negative binomial has two parameters: pr, arising from the original definition of NB k, θ from the gamma mixture µσ, 2 µα, mean and dispersion parameter 1/ α = r is sometimes called the size. 2 2 σ = µ + αµ April 28, 2015 SPH 247 Statistics for Laboratory Data 16

17 Generalized Linear Models The type of predictive model one uses depends on a number of issues; one is the type of response. Measured values such as quantity of a protein, age, weight usually can be handled in an ordinary linear regression model, possibly after a log transformation. Patient survival, which may be censored, calls for a different method (survival analysis, Cox regression). April 28, 2015 SPH 247 Statistics for Laboratory Data 17

18 If the response is binary, then can we use logistic regression models If the response is a count, we can use Poisson regression If the count has a higher variance than is consistent with the Poisson, we can use a negative binomial or quasipoisson Other forms of response can generate other types of generalized linear models One type of count data occurs in proteomics: number of unique peptide fragments mapping to the given protein. A similar count is the number of reads in RNA-Seq mapping to a particular gene. April 28, 2015 SPH 247 Statistics for Laboratory Data 18

19 Generalized Linear Models We need a linear predictor of the same form as in linear regression βx In theory, such a linear predictor can generate any type of number as a prediction, positive, negative, or zero We choose a suitable distribution for the type of data we are predicting (normal for any number, gamma for positive numbers, binomial for binary responses, Poisson for counts) We create a link function which maps the mean of the distribution onto the set of all possible linear prediction results, which is the whole real line (-, ). The inverse of the link function takes the linear predictor to the actual prediction April 28, 2015 SPH 247 Statistics for Laboratory Data 19

20 Ordinary linear regression has identity link (no transformation by the link function) and uses the normal distribution If one is predicting an inherently positive quantity, one may want to use the log link since e x is always positive. An alternative to using a generalized linear model with an log link, is to transform the data using the log or maybe glog. This is a device that works well with measurement data and may be usable in other cases April 28, 2015 SPH 247 Statistics for Laboratory Data 20

21 R glm() Families Family gaussian binomial Gamma inverse.gaussian poisson negative binomial quasi quasibinomial quasipoisson Links identity, log, inverse logit, probit, cauchit, log, cloglog inverse, identity, log 1/mu^2, inverse, identity, log log, identity, sqrt log, identity, sqrt (uses glm.nb()) identity, logit, probit, cloglog, inverse, log, 1/mu^2 and sqrt logit, probit, identity, cloglog, inverse, log, 1/mu^2 and sqrt log, identity, logit, probit, cloglog, inverse, 1/mu^2 and sqrt April 28, 2015 SPH 247 Statistics for Laboratory Data 21

22 R glm() Link Functions Links Domain Range identity (, ) (, ) log (0, ) (, ) inverse (0, ) (0, ) logit (0, 1) (, ) probit (0, 1) (, ) cloglog (0, 1) (, ) 1/mu^2 (0, ) (0, ) sqrt (0, ) (0, ) η = Xβ = g( µ ) = µ η = Xβ = g( µ ) = log( µ ) η = Xβ = g( µ ) = 1/ µ η = Xβ = g µ = Φ p 1 ( ) ( ) η = Xβ = g( µ ) = log( log(1 p)) 2 η = Xβ = g( µ ) = 1/ µ η = Xβ = g( µ ) = µ ( ) η = X β = g( µ ) = log p/ (1 p) April 28, 2015 SPH 247 Statistics for Laboratory Data 22

23 Possible Means 0 Link = Log - 0 Predictors April 28, 2015 SPH 247 Statistics for Laboratory Data 23

24 Possible Means 0 Inverse Link = e x - 0 Predictors April 28, 2015 SPH 247 Statistics for Laboratory Data 24

25 Logistic Regression: response is binary, coded as 0/1, with mean x p e ln = xβ px ( ) = 1 p 1 e n k n k f( k) = p (1 p) k E( k) = np V( k) = τnp(1 p) where τ = 1 if not overdispersed β xβ p Poisson Regression: response is a count, with mean λ ln( λ) = xβ λ = e k λ e f( k) = k! Ek ( ) = λ λ xβ V ( k) = τλ where τ = 1 if not overdispersed April 28, 2015 SPH 247 Statistics for Laboratory Data 25

26 Poisson Regression: response is a count, with mean λ ln( λ) = xβ λ = e k λ e f ( k) = k! Ek ( ) = λ λ xβ V( k) = τλ where τ = 1 if not overdispersed Negative Binomial Regression: response is a count, with mean λ ln( λ) = xβ λ = e x+ r 1 f( k) = p x rp Ek ( ) = λ = 1 p V k x xβ (1 p) n x = λ+ c λ = λ+ r λ τ = ( ) where 1 if not overdispersd e April 28, 2015 SPH 247 Statistics for Laboratory Data 26

27 Logistic Regression Suppose we are trying to predict a binary variable (patient has ovarian cancer or not, patient is responding to therapy or not) We can describe this by a 0/1 variable in which the value 1 is used for one response (patient has ovarian cancer) and 0 for the other (patient does not have ovarian cancer We can then try to predict this response April 28, 2015 SPH 247 Statistics for Laboratory Data 27

28 For a given patient, a prediction can be thought of as a kind of probability that the patient does have ovarian cancer. As such, the prediction should be between 0 and 1. Thus ordinary linear regression is not suitable The logit transform takes a number which can be anything, positive or negative, and produces a number between 0 and 1. Thus the logit link is useful for binary data April 28, 2015 SPH 247 Statistics for Laboratory Data 28

29 Possible Means 0 1 Link = Logit - 0 Predictors April 28, 2015 SPH 247 Statistics for Laboratory Data 29

30 Possible Means 0 1 Inverse Link = inverse logit - 0 Predictors April 28, 2015 SPH 247 Statistics for Laboratory Data 30

31 p logit ( p) = log if p 0 then logit( p) if p 1 then logit( p) 1 p e 1+ e x logit ( x) = if x then logit ( x) 0 if x then logit ( x) 1 x x x x e e e x x x log 1+ e log 1 e x = + = 1 x x 1 x log + e = log ( e ) = x e + e e x 1 x x + e + e 1+ e April 28, 2015 SPH 247 Statistics for Laboratory Data 31

32 April 28, 2015 SPH 247 Statistics for Laboratory Data 32

33 April 28, 2015 SPH 247 Statistics for Laboratory Data 33

34 Overdispersion A binomial random variable X with n observations and probability of success p has mean np and variance np(1 p). Could the variance be systematically larger? It could if p varies from trial to trial. The worst case is if X = n with probability p and X = 0 with probability 1 p. Then the mean is still np but the variance is now n 2 p(1 p) >> np(1 p). If we have overdispersion, we can use the quasibinomial family, which estimates the variance rather than assumes that it is np(1 p). April 28, 2015 SPH 247 Statistics for Laboratory Data 34

35 Poisson Distributions The Poisson distribution can be used to model unbounded count data, 0, 1, 2, 3, An example would be the number of cases of sepsis in each hospital in a city in a given month. The Poisson distribution has a single parameter λ, which is the mean of the distribution and also the variance. The standard deviation is λ April 28, 2015 SPH 247 Statistics for Laboratory Data 35

36 Poisson Regression If the mean λ of the Poisson distribution depends on variables x 1, x 2,, x p then we can use a generalized linear model with Poisson distribution and log link. We have that log(λ) is a linear function of x 1, x 2,, x p. This works pretty much like logistic regression, and is used for data in which the count has no specific upper limit (number of cases of lung cancer at a hospital) whereas logistic regression would be used when the count is the number out of a total (number of emergency room admissions positive for C. dificile out of the known total of admissions). With overdispersion, we can use the quasipoisson family or the negative binomial. April 28, 2015 SPH 247 Statistics for Laboratory Data 36

37 Existing Methods for RNA-Seq Existing methods often contain complex combinations or filtering, normalization, transformation, and variance estimation before any statistical tests are performed. The methods are often poorly documented and change rapidly and often substantially between frequent versions, without any push notice. The results of different methods such as DESeq, edger, and voom/limma can differ substantially. It appears that most methods produce large numbers of false positives. April 28, 2015 SPH 247 Statistics for Laboratory Data 37

38 Bottomly Data Two inbred mouse strains, C7BL6/6J (10 animals) and DBA/2J (11 animals) Gene expression in striatal neurons by RNA-Seq and expression arrays. 36,536 unique genes of which 11,870 had a count of at least 10 across the 21 samples. This data set has been used in other comparisons We compare three packages on this data set. Later we construct data sets where the null hypothesis is true. We see what some simple methods can do. April 28, 2015 SPH 247 Statistics for Laboratory Data 38

39 Statistical Tests Example gene from Bottomly data ENSMUSG C57BL/6J 2, 7, 9, 11, 3, 7, 1, 1, 2, 1 Median 2.5, Mean 4.4, sd 3.75, var DBA/2J 8, 6, 12, 10, 19, 15, 17, 6, 20, 6, 14 Median 12, Mean 12.09, sd 5.28, var April 28, 2015 SPH 247 Statistics for Laboratory Data 39

40 April 28, 2015 SPH 247 Statistics for Laboratory Data 40

41 Poisson Analysis One of the first ideas was that the counts might be Poisson distributed and that we could use this for statistical tests. In this case, if the null hypothesis was true, then the 21 observations came from the same Poisson distribution with observed mean and with variance also estimated at (Poisson distributions have mean and variance that are equal). This implies that the variance of the mean of the 10 C7BL/6J counts should be 8.429/10 and the variance of the mean of the 11 DBA/2J counts should be 8.429/11 So we can use z = ( )/ (8.429/ /11) = 6.06 to test the difference April 28, 2015 SPH 247 Statistics for Laboratory Data 41

42 Poisson Analysis This turns out to be a terrible idea. Frequently, the variance of the data is much larger than the mean. We call this overdispersion. C57BL/6J Mean 4.4, Variance DBA/2J Mean 12.09, Variance Also, Poisson analysis can be done with one treatment and one control, and any belief that statistical evidence can be gathered with no biological replicates is evidence of terminal delusion! April 28, 2015 SPH 247 Statistics for Laboratory Data 42

43 Over-dispersed Poisson Analysis In this case, we estimate the means and the variances from the data using either the negative binomial distribution or simply an over-dispersed Poisson (quasi-poisson). These may yield similar results, but need not. In the negative binomial case, we model the data so that, for a given gene g and /sample i, the count y has E( y ) ig Var( y 2 2 ig ig ig For the quasi-poisson, Va r( y ig = µ ) ) ig = µ + c µ = θµ 2 ig ig April 28, 2015 SPH 247 Statistics for Laboratory Data 43

44 > summary(glm(y1~strain,family=poisson)) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) < 2e-16 *** straindba/2j e-09 *** (Dispersion parameter for poisson family taken to be 1) > anova(glm(y1~strain,family=poisson),test="chisq") Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL strain e-10 *** First test is Wald test, second is likelihood ratio test. But Poisson assumption is not valid. April 28, 2015 SPH 247 Statistics for Laboratory Data 44

45 > summary(glm(y1~strain,family=quasipoisson)) Coefficients: Estimate Std. Error t value Pr(> t ) (Intercept) e-06 *** straindba/2j ** (Dispersion parameter for quasipoisson family taken to be ) > anova(glm(y1~strain,family=quasipoisson),test="chisq") Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL strain *** April 28, 2015 SPH 247 Statistics for Laboratory Data 45

46 > summary(glm.nb(y1~strain)) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) e-13 *** straindba/2j e-05 *** (Dispersion parameter for Negative Binomial(5.1558) family taken to be 1) > anova(glm.nb(y1~strain),test="chisq") Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL strain e-05 *** April 28, 2015 SPH 247 Statistics for Laboratory Data 46

47 P-values ( 10 5 ) Wald LR Quasi-Poisson Negative Binomial Both the Wald test and the likelihood ratio test are valid asymptotically. The quasi-poisson and negative binomial should be similar asymptotically, depending on the variances In finite samples there can be substantial differences April 28, 2015 SPH 247 Statistics for Laboratory Data 47

48 > t.test(y1~strain) Welch Two Sample t-test t = , df = , p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean in group C57BL/6J mean in group DBA/2J Similar results to that using the negative binomial or quasi-poisson. No assumption about relationship of variance to mean. Fits separate variances to the two groups. April 28, 2015 SPH 247 Statistics for Laboratory Data 48

49 Is the Test Statistic Important? Not too much if the Negative Binomial is True Simulated negative binomial count data Two groups of 10 Mean 5 and variance 6 10,000 trials Test Reject at 5% Reject at 1% Reject at 0.1% t-test 4.9% 0.90% 0.11% Wilcoxon Test 4.6% 0.74% 0.03% NB Likelihood Ratio Test 5.2% 0.95% 0.13% NB Wald Test 5.7% 1.16% 0.22% 4.8% 5.2% 0.90% 1.11% 0.07% 0.12% April 28, 2015 SPH 247 Statistics for Laboratory Data 49

50 825 Genes Significant by at least one method 486 by all methods More significant genes by edger Is this greater power or is the test producing false positives? April 28, 2015 SPH 247 Statistics for Laboratory Data 50

51 Null Behavior 100 randomly selected subsets of size 6 out of the 10 C7BL6 mice and 100 randomly selected subsets of size 6 from the DBA/2J In each subset, 3 assigned to treatment and 3 to control at random. The null hypothesis is on the average true. Given the 11,870 genes with large enough counts, we would expect about 0.01% to be significant at the p = level. We have 2 1,187,000 tests, so there should be about 237 rejections for each method April 28, 2015 SPH 247 Statistics for Laboratory Data 51

52 Significant at p = Null Hypothesis is True Estimator Observed Expected Over-Rejection DESeq DESeq edger edger-robust limma-voom April 28, 2015 SPH 247 Statistics for Laboratory Data 52

53 Important Factors These methods share some features but differ in their implementation. Most of them treat the data as over-dispersed Poisson, or specifically negative binomial They may use a test based on the negative binomial distribution, but limma-voom uses standard regression/anova with variance-based weights Since sample sizes tend to be small, they all have the possibility to replace the variance estimate from a given gene with a smoothed estimate based on all the genes. Normalization may be needed due to the differing total numbers of reads April 28, 2015 SPH 247 Statistics for Laboratory Data 53

54 Factors That May Cause False Positives Normalization procedure Normalization driven by high-abundance transcripts can cause spurious significance of low abundance ones. Variance estimation Many methods borrow strength from other genes to get better variance estimates. This may cause bias Statistical test used Many choices Nature of the data May require filtering April 28, 2015 SPH 247 Statistics for Laboratory Data 54

55 Normalization Normalization is commonly used in differential expression analysis with microarrays. Normalization for RNA-Seq is often couched in terms like library size as if we should divide by the total (mapped) fragment count This can be problematic because it depends on only a few genes. Mostly, normalization in RNA-Seq is done with a single constant per sample, though this is unusual with expression arrays April 28, 2015 SPH 247 Statistics for Laboratory Data 55

56 Library Size Normalization Using the total fragment count is problematic because highly expressed genes will provide most of the fragments Four genes with expression 10,000, 100, 150, 200 in condition A and 20,000, 100, 150, 200 in condition B. Normalized fragment counts use total fragment counts of 10,450 and 20,450 and can be normalized to 15,450 Normalized fragment counts are 14,785, 148, 222, 296 in condition A and 15,110, 76, 113, 151 in condition B, so upregulation of gene 1 has been turned into down-regulation of the other three. Fold changes are 2.0, 1.0, 1.0, 1.0 before normalization and 1.02, 0.51, 0.51, 0.51 after. April 28, 2015 SPH 247 Statistics for Laboratory Data 56

57 Normalization Methods Total count Quantile Normalization of other signal based methods Geometric normalization (Cuffdiff2/DESeq version) For each gene, compute the geometric mean of the total fragment count across libraries Library size is the median across genes of the total fragment count divided by the geometric mean fragment count. In our 4-gene example, the geometric means are 14,142, 100, 150, 200, the ratios for A are 0.707, 1, 1, 1 and for B are 1.414, 1, 1, 1, so the size factors are the medians, namely 1 and 1 Median normalization: divide each count by the median count of that sample (and multiply by the mean of the medians) April 28, 2015 SPH 247 Statistics for Laboratory Data 57

58 k fragment count for gene i in library j ij m 1/ m m 1 gi = ki ν = ki ν ν = 1 m ν = 1 s j exp log( ) kij = median i g i Cuffdiff2 first normalizes replicates under the same conditions giving an internal library size of s j Then the arithmetic mean of the scaled gene counts for each gene is used to compute an external library size of η j. This is a possible source of problems, the scale of which is unknown April 28, 2015 SPH 247 Statistics for Laboratory Data 58

59 Variance Estimation Many RNA-Seq experiments are small. Small studies have low power Very small p-values are needed to pass the false discovery filter So the default for small studies is no results Suppose two means differ by one standard deviation With 10,000 genes, the Bonferroni level is With two groups of 3, there are ~4df and corresponds to almost 50 standard deviations April 28, 2015 SPH 247 Statistics for Laboratory Data 59

60 Multiplicity Problems Suppose we have 10,000 genes and can run a test on each one. If we use a criterion of p = 0.05, then we will have 500 false positives. We can help this with a lower value of p, such as (~10 fp) If we pick p = 0.05 /10,000 then fp = 0 with probability this is too strict Mostly, we use false discovery rate control which maps the vector of p-values to a vector of adjusted p-values. On the average, the set of tests with FDR adjusted p-values below 10% contains at most 10% false discoveries. But this still requires very small p-values to be significant. These can be very hard to achieve with small sample sizes. April 28, 2015 SPH 247 Statistics for Laboratory Data 60

61 Variance Estimation If we use tests that reference only the data from the specific gene, then usually variance estimation is not a problem. But with small sample sizes, the power is low, so there is a temptation to improve the variance estimates by smoothing or shrinkage. The variance of a negative binomial increases with the mean in a way controlled by a variance parameter. We can smooth the plot of the sample variance vs. the mean and use the smoothed estimate instead of the per-gene estimate. April 28, 2015 SPH 247 Statistics for Laboratory Data 61

62 A Poisson random variable with parameter λ has µ = λ σ 2 = λ A negative binomial random variable can be seen as a mixture of Poissons with varying λ µ = λ σ = λ+ c λ σ / µ = λ + c c s x cˆ = if this is positive 2 x If we smooth a plot of the variance or the square CV or cˆ 2 vs. the mean we obtain an average estimate of c for the given value of µ We can use the individual variance, the smoothed variance, or a compromise April 28, 2015 SPH 247 Statistics for Laboratory Data 62

63 April 28, 2015 SPH 247 Statistics for Laboratory Data 63

64 April 28, 2015 SPH 247 Statistics for Laboratory Data 64

65 Variance Estimation The sample variance is an unbiased estimate of the population variance A smoothed variance will be biased down or up depending on the data point While this can reduce the MSE of estimating the variance, it may increase false positives and false negatives for tests based on those variance estimates This is a possible source of the differences in results in various methods of analysis. April 28, 2015 SPH 247 Statistics for Laboratory Data 65

66 Empirical Bayes The original inspiration for the idea of variance shrinkage is from a result of Charles Stein and implementation in the form of empirical Bayes analysis. Empirical Bayes means that the prior, instead of being given in advance, is estimated also from the data. This can sometimes be an effective method for reducing the mean square error of estimation. We go through this in the case of measured response (as in arrays) not counted response, where the math is only approximate. April 28, 2015 SPH 247 Statistics for Laboratory Data 66

67 Suppose we have n observations from a normal distribution 2 ~ (, ),1 x i N µσ i n then x is the 'best' estimator of µ in a variety of ways: Minimum Variance Unbiased Estimator Minimum Mean Square Erro r Estimator Maximum Likelihood Estimator What if we have multiple distributions? x N µ σ i n,1 j m 2 ij ~ ( j, j ),1 j The estimates x j for µ are still MVUE and MLE, but not MMSQEE. j This idea is due to Charles Stein and later Willard James. Suppose that for simplicity we assume that all the variances are known and that µ θτ 2 j ~ N(, ) f( µ j, xj) f( xj µ j) f( µ j) f( µ j xj) = = which is itself a normal distribution with mean f( x ) f( x ) j kx k k n j j + (1 ) θ where = τ / ( τ + σ j / j) a weighted average of the aggregate mean and the individual mean 2 but this is only helpful if we know or can estimate θ and τ April 28, 2015 SPH 247 Statistics for Laboratory Data 67

68 µ θτ x x j 2 ~ N(, ) 2 j µ j ~ N( µ j, σ ) j is normally distributed E( x) = E [ E ( x µ )] = E [ µ ] = θ µ x µ µ 2 2 V( x) = V( x µ ) + V( µ ) = σ / n+ τ ˆ θ = x ˆ τ = Vˆ ( x) se ( x) 2 2 pooled or use maximum likelihood A similar setup works for estimation of variances. We have the individual variance estimates and a hypothesized prior distribution for the true variances. The conjugate prior is the inverse gamma. We use the posterior estimate of the individual variance, and estimate the parameters of the prior. Empirical Bayes. April 28, 2015 SPH 247 Statistics for Laboratory Data 68

69 We have p measurements (genes) and n arrays and a linear model to apply to each gene. The tests (F- or T- ) use the MSE or its square root as the denominator. If we have only a small df for each gene, then we may want to 'borrow strength' The true residual variance (MSE) for gene i is σ The observed residual variance (MSE) for gene i is s ~ σ χ ( ν) / ν where ν is the residual degrees of freedom i i 2 This has a gamma distribution with mean τ = σ and shape a = ν / 2. If the τ = σ i 2 i i 2 i i are random and distributed from an inverse gamma distribution (reciprocal of a gamma) then this has param eters α and β or α and η = αβ ( η is the mean of the reciprocal of τ, which is sometimes called the precision). The density is f ( ταη ;, ) τ e = τ e α 1 α / ητ α 1 1/ τβ April 28, 2015 SPH 247 Statistics for Laboratory Data 69

70 2 The posterior distribution of τ given s i is proporti e τ 1/ τβ * where 2 = which, like the prior, is inverse gamma with parameters xν + 2 α / η ν/2+ α+ 1 * β * α ν α * β η = /2+ 2 = xν + 2 α / η = αβ = * * * ν + 2α xν + 2 α / η x = onal to 1 xν + 2 α / η 2 ν 1 2α = = s * i + η ν + 2α ν + 2α η ν + 2α So our estimate of the MSE is a weighted average of the observed MSE and the prior (reciprocal mean precision) with weights equal to the degrees of freedom April 28, 2015 SPH 247 Statistics for Laboratory Data 70

71 The observed gene-specific variance s distribution with mean τ and a = ν / 2, and if the prior for τ is inverse i 2 i has a gamma (chi-squared) gamma with parameters α and β, then it can be shown that the overall distribution of the observed gene-specific variances is 2 αβ si = F ν,2 α Given the collection of gene-specific variances, we can estimate the parameters α and β of the prior by maximum likelihood or by comparing moments. If the residual degrees of freedom is large, then it is probably best to use the gene-specific p-values. If it is small, there can be a substantial gain in power using the posterior p-values (p-values from the F- or t- test using the posterior MSE). April 28, 2015 SPH 247 Statistics for Laboratory Data 71

72 Lowess Normalization If we have vectors x 1, x 2,, x n each of length k, we can mean normalize them by computing the sample means m 1, m 2,, m n of each, and the grand mean m, and then replacing x i by x i m i +m. Intensity based normalization does this separately for small, medium, large genes, for possibly many bands, or in a running average. Lowess normalization orders the genes by mean expression and does a lowess smooth of each gene, subtracts that, and adds the mean of the lowess smooths. April 28, 2015 SPH 247 Statistics for Laboratory Data 72

73 April 28, 2015 SPH 247 Statistics for Laboratory Data 73

74 Conclusions Analysis of RNA-Seq data is still in the early stages of development. Existing programs vary substantially in the results and it is unclear which is the most reliable More work is needed that focuses on the fundamental components of analysis, particularly choice of test statistics, variance estimation, and sample normalization. April 28, 2015 SPH 247 Statistics for Laboratory Data 74

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis

David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis David M. Rocke Division of Biostatistics and Department of Biomedical Engineering University of California, Davis March 18, 2016 UVA Seminar RNA Seq 1 RNA Seq Gene expression is the transcription of the

More information

Generalized linear models

Generalized linear models Generalized linear models Douglas Bates November 01, 2010 Contents 1 Definition 1 2 Links 2 3 Estimating parameters 5 4 Example 6 5 Model building 8 6 Conclusions 8 7 Summary 9 1 Generalized Linear Models

More information

High-Throughput Sequencing Course

High-Throughput Sequencing Course High-Throughput Sequencing Course DESeq Model for RNA-Seq Biostatistics and Bioinformatics Summer 2017 Outline Review: Standard linear regression model (e.g., to model gene expression as function of an

More information

Modeling Overdispersion

Modeling Overdispersion James H. Steiger Department of Psychology and Human Development Vanderbilt University Regression Modeling, 2009 1 Introduction 2 Introduction In this lecture we discuss the problem of overdispersion in

More information

Generalized Linear Models

Generalized Linear Models York SPIDA John Fox Notes Generalized Linear Models Copyright 2010 by John Fox Generalized Linear Models 1 1. Topics I The structure of generalized linear models I Poisson and other generalized linear

More information

Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA

Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA Expression analysis for RNA-seq data Ewa Szczurek Instytut Informatyki Uniwersytet Warszawski 1/35 The problem

More information

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data

Lecture 3: Mixture Models for Microbiome data. Lecture 3: Mixture Models for Microbiome data Lecture 3: Mixture Models for Microbiome data 1 Lecture 3: Mixture Models for Microbiome data Outline: - Mixture Models (Negative Binomial) - DESeq2 / Don t Rarefy. Ever. 2 Hypothesis Tests - reminder

More information

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics Linear, Generalized Linear, and Mixed-Effects Models in R John Fox McMaster University ICPSR 2018 John Fox (McMaster University) Statistical Models in R ICPSR 2018 1 / 19 Linear and Generalized Linear

More information

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples. Regression models Generalized linear models in R Dr Peter K Dunn http://www.usq.edu.au Department of Mathematics and Computing University of Southern Queensland ASC, July 00 The usual linear regression

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models 1/37 The Kelp Data FRONDS 0 20 40 60 20 40 60 80 100 HLD_DIAM FRONDS are a count variable, cannot be < 0 2/37 Nonlinear Fits! FRONDS 0 20 40 60 log NLS 20 40 60 80 100 HLD_DIAM

More information

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form: Outline for today What is a generalized linear model Linear predictors and link functions Example: fit a constant (the proportion) Analysis of deviance table Example: fit dose-response data using logistic

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Statistics for Differential Expression in Sequencing Studies. Naomi Altman

Statistics for Differential Expression in Sequencing Studies. Naomi Altman Statistics for Differential Expression in Sequencing Studies Naomi Altman naomi@stat.psu.edu Outline Preliminaries what you need to do before the DE analysis Stat Background what you need to know to understand

More information

Lecture: Mixture Models for Microbiome data

Lecture: Mixture Models for Microbiome data Lecture: Mixture Models for Microbiome data Lecture 3: Mixture Models for Microbiome data Outline: - - Sequencing thought experiment Mixture Models (tangent) - (esp. Negative Binomial) - Differential abundance

More information

SPRING 2007 EXAM C SOLUTIONS

SPRING 2007 EXAM C SOLUTIONS SPRING 007 EXAM C SOLUTIONS Question #1 The data are already shifted (have had the policy limit and the deductible of 50 applied). The two 350 payments are censored. Thus the likelihood function is L =

More information

Generalized linear models

Generalized linear models Generalized linear models Outline for today What is a generalized linear model Linear predictors and link functions Example: estimate a proportion Analysis of deviance Example: fit dose- response data

More information

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches Sta 216, Lecture 4 Last Time: Logistic regression example, existence/uniqueness of MLEs Today s Class: 1. Hypothesis testing through analysis of deviance 2. Standard errors & confidence intervals 3. Model

More information

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data Cinzia Viroli 1 joint with E. Bonafede 1, S. Robin 2 & F. Picard 3 1 Department of Statistical Sciences, University

More information

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/ Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/28.0018 Statistical Analysis in Ecology using R Linear Models/GLM Ing. Daniel Volařík, Ph.D. 13.

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models Generalized Linear Models - part II Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs.

More information

36-463/663: Multilevel & Hierarchical Models

36-463/663: Multilevel & Hierarchical Models 36-463/663: Multilevel & Hierarchical Models (P)review: in-class midterm Brian Junker 132E Baker Hall brian@stat.cmu.edu 1 In-class midterm Closed book, closed notes, closed electronics (otherwise I have

More information

Exam C Solutions Spring 2005

Exam C Solutions Spring 2005 Exam C Solutions Spring 005 Question # The CDF is F( x) = 4 ( + x) Observation (x) F(x) compare to: Maximum difference 0. 0.58 0, 0. 0.58 0.7 0.880 0., 0.4 0.680 0.9 0.93 0.4, 0.6 0.53. 0.949 0.6, 0.8

More information

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 Introduction to Generalized Univariate Models: Models for Binary Outcomes EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 EPSY 905: Intro to Generalized In This Lecture A short review

More information

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1 Parametric Modelling of Over-dispersed Count Data Part III / MMath (Applied Statistics) 1 Introduction Poisson regression is the de facto approach for handling count data What happens then when Poisson

More information

Nonlinear Models. What do you do when you don t have a line? What do you do when you don t have a line? A Quadratic Adventure

Nonlinear Models. What do you do when you don t have a line? What do you do when you don t have a line? A Quadratic Adventure What do you do when you don t have a line? Nonlinear Models Spores 0e+00 2e+06 4e+06 6e+06 8e+06 30 40 50 60 70 longevity What do you do when you don t have a line? A Quadratic Adventure 1. If nonlinear

More information

scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017

scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017 scrna-seq Differential expression analysis methods Olga Dethlefsen NBIS, National Bioinformatics Infrastructure Sweden October 2017 Olga (NBIS) scrna-seq de October 2017 1 / 34 Outline Introduction: what

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models Generalized Linear Models - part III Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs.

More information

UNIVERSITY OF TORONTO Faculty of Arts and Science

UNIVERSITY OF TORONTO Faculty of Arts and Science UNIVERSITY OF TORONTO Faculty of Arts and Science December 2013 Final Examination STA442H1F/2101HF Methods of Applied Statistics Jerry Brunner Duration - 3 hours Aids: Calculator Model(s): Any calculator

More information

Statistical tests for differential expression in count data (1)

Statistical tests for differential expression in count data (1) Statistical tests for differential expression in count data (1) NBIC Advanced RNA-seq course 25-26 August 2011 Academic Medical Center, Amsterdam The analysis of a microarray experiment Pre-process image

More information

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

STAT 526 Spring Midterm 1. Wednesday February 2, 2011 STAT 526 Spring 2011 Midterm 1 Wednesday February 2, 2011 Time: 2 hours Name (please print): Show all your work and calculations. Partial credit will be given for work that is partially correct. Points

More information

Poisson Regression. Gelman & Hill Chapter 6. February 6, 2017

Poisson Regression. Gelman & Hill Chapter 6. February 6, 2017 Poisson Regression Gelman & Hill Chapter 6 February 6, 2017 Military Coups Background: Sub-Sahara Africa has experienced a high proportion of regime changes due to military takeover of governments for

More information

Sociology 6Z03 Review II

Sociology 6Z03 Review II Sociology 6Z03 Review II John Fox McMaster University Fall 2016 John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 1 / 35 Outline: Review II Probability Part I Sampling Distributions Probability

More information

DEXSeq paper discussion

DEXSeq paper discussion DEXSeq paper discussion L Collado-Torres December 10th, 2012 1 / 23 1 Background 2 DEXSeq paper 3 Results 2 / 23 Gene Expression 1 Background 1 Source: http://www.ncbi.nlm.nih.gov/projects/genome/probe/doc/applexpression.shtml

More information

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model Regression: Part II Linear Regression y~n X, 2 X Y Data Model β, σ 2 Process Model Β 0,V β s 1,s 2 Parameter Model Assumptions of Linear Model Homoskedasticity No error in X variables Error in Y variables

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population

More information

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples ST3241 Categorical Data Analysis I Generalized Linear Models Introduction and Some Examples 1 Introduction We have discussed methods for analyzing associations in two-way and three-way tables. Now we will

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) (b) (c) (d) (e) In 2 2 tables, statistical independence is equivalent

More information

Lecture 14: Introduction to Poisson Regression

Lecture 14: Introduction to Poisson Regression Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why

More information

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week

More information

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion Biostokastikum Overdispersion is not uncommon in practice. In fact, some would maintain that overdispersion is the norm in practice and nominal dispersion the exception McCullagh and Nelder (1989) Overdispersion

More information

Comparative analysis of RNA- Seq data with DESeq2

Comparative analysis of RNA- Seq data with DESeq2 Comparative analysis of RNA- Seq data with DESeq2 Simon Anders EMBL Heidelberg Two applications of RNA- Seq Discovery Eind new transcripts Eind transcript boundaries Eind splice junctions Comparison Given

More information

9 Generalized Linear Models

9 Generalized Linear Models 9 Generalized Linear Models The Generalized Linear Model (GLM) is a model which has been built to include a wide range of different models you already know, e.g. ANOVA and multiple linear regression models

More information

Bayesian Classification Methods

Bayesian Classification Methods Bayesian Classification Methods Suchit Mehrotra North Carolina State University smehrot@ncsu.edu October 24, 2014 Suchit Mehrotra (NCSU) Bayesian Classification October 24, 2014 1 / 33 How do you define

More information

Generalized Linear Models 1

Generalized Linear Models 1 Generalized Linear Models 1 STA 2101/442: Fall 2012 1 See last slide for copyright information. 1 / 24 Suggested Reading: Davison s Statistical models Exponential families of distributions Sec. 5.2 Chapter

More information

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models SCHOOL OF MATHEMATICS AND STATISTICS Linear and Generalised Linear Models Autumn Semester 2017 18 2 hours Attempt all the questions. The allocation of marks is shown in brackets. RESTRICTED OPEN BOOK EXAMINATION

More information

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010 1 Linear models Y = Xβ + ɛ with ɛ N (0, σ 2 e) or Y N (Xβ, σ 2 e) where the model matrix X contains the information on predictors and β includes all coefficients (intercept, slope(s) etc.). 1. Number of

More information

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20 Logistic regression 11 Nov 2010 Logistic regression (EPFL) Applied Statistics 11 Nov 2010 1 / 20 Modeling overview Want to capture important features of the relationship between a (set of) variable(s)

More information

Linear Regression Models P8111

Linear Regression Models P8111 Linear Regression Models P8111 Lecture 25 Jeff Goldsmith April 26, 2016 1 of 37 Today s Lecture Logistic regression / GLMs Model framework Interpretation Estimation 2 of 37 Linear regression Course started

More information

STAT 4385 Topic 01: Introduction & Review

STAT 4385 Topic 01: Introduction & Review STAT 4385 Topic 01: Introduction & Review Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso xsu@utep.edu Spring, 2016 Outline Welcome What is Regression Analysis? Basics

More information

g A n(a, g) n(a, ḡ) = n(a) n(a, g) n(a) B n(b, g) n(a, ḡ) = n(b) n(b, g) n(b) g A,B A, B 2 RNA-seq (D) RNA mrna [3] RNA 2. 2 NGS 2 A, B NGS n(

g A n(a, g) n(a, ḡ) = n(a) n(a, g) n(a) B n(b, g) n(a, ḡ) = n(b) n(b, g) n(b) g A,B A, B 2 RNA-seq (D) RNA mrna [3] RNA 2. 2 NGS 2 A, B NGS n( ,a) RNA-seq RNA-seq Cuffdiff, edger, DESeq Sese Jun,a) Abstract: Frequently used biological experiment technique for observing comprehensive gene expression has been changed from microarray using cdna

More information

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University Poisson Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Poisson Regression 1 / 49 Poisson Regression 1 Introduction

More information

BMI 541/699 Lecture 22

BMI 541/699 Lecture 22 BMI 541/699 Lecture 22 Where we are: 1. Introduction and Experimental Design 2. Exploratory Data Analysis 3. Probability 4. T-based methods for continous variables 5. Power and sample size for t-based

More information

Generalized Linear Models (1/29/13)

Generalized Linear Models (1/29/13) STA613/CBB540: Statistical methods in computational biology Generalized Linear Models (1/29/13) Lecturer: Barbara Engelhardt Scribe: Yangxiaolu Cao When processing discrete data, two commonly used probability

More information

Advanced Statistical Methods: Beyond Linear Regression

Advanced Statistical Methods: Beyond Linear Regression Advanced Statistical Methods: Beyond Linear Regression John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Worshop 28 March 2009 1 http://www.stat.usu.edu/~jrstevens/pcmi

More information

Generalized Linear Models (GLZ)

Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) Generalized Linear Models (GLZ) are an extension of the linear modeling process that allows models to be fit to data that follow probability distributions other than the

More information

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 Statistics Boot Camp Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 March 21, 2018 Outline of boot camp Summarizing and simplifying data Point and interval estimation Foundations of statistical

More information

Logistic Regressions. Stat 430

Logistic Regressions. Stat 430 Logistic Regressions Stat 430 Final Project Final Project is, again, team based You will decide on a project - only constraint is: you are supposed to use techniques for a solution that are related to

More information

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006 Hypothesis Testing Part I James J. Heckman University of Chicago Econ 312 This draft, April 20, 2006 1 1 A Brief Review of Hypothesis Testing and Its Uses values and pure significance tests (R.A. Fisher)

More information

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data Ronald Heck Class Notes: Week 8 1 Class Notes: Week 8 Probit versus Logit Link Functions and Count Data This week we ll take up a couple of issues. The first is working with a probit link function. While

More information

BTRY 4090: Spring 2009 Theory of Statistics

BTRY 4090: Spring 2009 Theory of Statistics BTRY 4090: Spring 2009 Theory of Statistics Guozhang Wang September 25, 2010 1 Review of Probability We begin with a real example of using probability to solve computationally intensive (or infeasible)

More information

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS Duration - 3 hours Aids Allowed: Calculator LAST NAME: FIRST NAME: STUDENT NUMBER: There are 27 pages

More information

Model Estimation Example

Model Estimation Example Ronald H. Heck 1 EDEP 606: Multivariate Methods (S2013) April 7, 2013 Model Estimation Example As we have moved through the course this semester, we have encountered the concept of model estimation. Discussions

More information

Course 4 Solutions November 2001 Exams

Course 4 Solutions November 2001 Exams Course 4 Solutions November 001 Exams November, 001 Society of Actuaries Question #1 From the Yule-Walker equations: ρ φ + ρφ 1 1 1. 1 1+ ρ ρφ φ Substituting the given quantities yields: 0.53 φ + 0.53φ

More information

R Hints for Chapter 10

R Hints for Chapter 10 R Hints for Chapter 10 The multiple logistic regression model assumes that the success probability p for a binomial random variable depends on independent variables or design variables x 1, x 2,, x k.

More information

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn CHAPTER 6 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, and Colonic Polyps

More information

You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials.

You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials. The GENMOD Procedure MODEL Statement MODEL response = < effects > < /options > ; MODEL events/trials = < effects > < /options > ; You can specify the response in the form of a single variable or in the

More information

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017 Binary Regression GH Chapter 5, ISL Chapter 4 January 31, 2017 Seedling Survival Tropical rain forests have up to 300 species of trees per hectare, which leads to difficulties when studying processes which

More information

Biochip informatics-(i)

Biochip informatics-(i) Biochip informatics-(i) : biochip normalization & differential expression Ju Han Kim, M.D., Ph.D. SNUBI: SNUBiomedical Informatics http://www.snubi snubi.org/ Biochip Informatics - (I) Biochip basics Preprocessing

More information

Differential expression analysis for sequencing count data. Simon Anders

Differential expression analysis for sequencing count data. Simon Anders Differential expression analysis for sequencing count data Simon Anders RNA-Seq Count data in HTS RNA-Seq Tag-Seq Gene 13CDNA73 A2BP1 A2M A4GALT AAAS AACS AADACL1 [...] ChIP-Seq Bar-Seq... GliNS1 4 19

More information

Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2)

Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2) Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2) B.H. Robbins Scholars Series June 23, 2010 1 / 29 Outline Z-test χ 2 -test Confidence Interval Sample size and power Relative effect

More information

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Generalized Linear Models. Last time: Background & motivation for moving beyond linear Generalized Linear Models Last time: Background & motivation for moving beyond linear regression - non-normal/non-linear cases, binary, categorical data Today s class: 1. Examples of count and ordered

More information

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments by Gordon K. Smyth (as interpreted by Aaron J. Baraff) STAT 572 Intro Talk April 10, 2014 Microarray

More information

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46 A Generalized Linear Model for Binomial Response Data Copyright c 2017 Dan Nettleton (Iowa State University) Statistics 510 1 / 46 Now suppose that instead of a Bernoulli response, we have a binomial response

More information

Linear Mixed Models. One-way layout REML. Likelihood. Another perspective. Relationship to classical ideas. Drawbacks.

Linear Mixed Models. One-way layout REML. Likelihood. Another perspective. Relationship to classical ideas. Drawbacks. Linear Mixed Models One-way layout Y = Xβ + Zb + ɛ where X and Z are specified design matrices, β is a vector of fixed effect coefficients, b and ɛ are random, mean zero, Gaussian if needed. Usually think

More information

STAT 461/561- Assignments, Year 2015

STAT 461/561- Assignments, Year 2015 STAT 461/561- Assignments, Year 2015 This is the second set of assignment problems. When you hand in any problem, include the problem itself and its number. pdf are welcome. If so, use large fonts and

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn A Handbook of Statistical Analyses Using R 2nd Edition Brian S. Everitt and Torsten Hothorn CHAPTER 7 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, Colonic

More information

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Statistics 203: Introduction to Regression and Analysis of Variance Course review Statistics 203: Introduction to Regression and Analysis of Variance Course review Jonathan Taylor - p. 1/?? Today Review / overview of what we learned. - p. 2/?? General themes in regression models Specifying

More information

Normalization of metagenomic data A comprehensive evaluation of existing methods

Normalization of metagenomic data A comprehensive evaluation of existing methods MASTER S THESIS Normalization of metagenomic data A comprehensive evaluation of existing methods MIKAEL WALLROTH Department of Mathematical Sciences CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG

More information

Normalization and differential analysis of RNA-seq data

Normalization and differential analysis of RNA-seq data Normalization and differential analysis of RNA-seq data Nathalie Villa-Vialaneix INRA, Toulouse, MIAT (Mathématiques et Informatique Appliquées de Toulouse) nathalie.villa@toulouse.inra.fr http://www.nathalievilla.org

More information

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011

Lecture 2: Linear Models. Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011 Lecture 2: Linear Models Bruce Walsh lecture notes Seattle SISG -Mixed Model Course version 23 June 2011 1 Quick Review of the Major Points The general linear model can be written as y = X! + e y = vector

More information

Generalized Linear Models Introduction

Generalized Linear Models Introduction Generalized Linear Models Introduction Statistics 135 Autumn 2005 Copyright c 2005 by Mark E. Irwin Generalized Linear Models For many problems, standard linear regression approaches don t work. Sometimes,

More information

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis Review Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1 / 22 Chapter 1: background Nominal, ordinal, interval data. Distributions: Poisson, binomial,

More information

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper Student Name: ID: McGill University Faculty of Science Department of Mathematics and Statistics Statistics Part A Comprehensive Exam Methodology Paper Date: Friday, May 13, 2016 Time: 13:00 17:00 Instructions

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Binary Dependent Variables

Binary Dependent Variables Binary Dependent Variables In some cases the outcome of interest rather than one of the right hand side variables - is discrete rather than continuous Binary Dependent Variables In some cases the outcome

More information

MODELING COUNT DATA Joseph M. Hilbe

MODELING COUNT DATA Joseph M. Hilbe MODELING COUNT DATA Joseph M. Hilbe Arizona State University Count models are a subset of discrete response regression models. Count data are distributed as non-negative integers, are intrinsically heteroskedastic,

More information

Swarthmore Honors Exam 2012: Statistics

Swarthmore Honors Exam 2012: Statistics Swarthmore Honors Exam 2012: Statistics 1 Swarthmore Honors Exam 2012: Statistics John W. Emerson, Yale University NAME: Instructions: This is a closed-book three-hour exam having six questions. You may

More information

Stat 579: Generalized Linear Models and Extensions

Stat 579: Generalized Linear Models and Extensions Stat 579: Generalized Linear Models and Extensions Yan Lu Jan, 2018, week 3 1 / 67 Hypothesis tests Likelihood ratio tests Wald tests Score tests 2 / 67 Generalized Likelihood ratio tests Let Y = (Y 1,

More information

Statistics 572 Semester Review

Statistics 572 Semester Review Statistics 572 Semester Review Final Exam Information: The final exam is Friday, May 16, 10:05-12:05, in Social Science 6104. The format will be 8 True/False and explains questions (3 pts. each/ 24 pts.

More information

Institute of Actuaries of India

Institute of Actuaries of India Institute of Actuaries of India Subject CT3 Probability & Mathematical Statistics May 2011 Examinations INDICATIVE SOLUTION Introduction The indicative solution has been written by the Examiners with the

More information

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla.

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla. Experimental Design and Statistical Methods Workshop LOGISTIC REGRESSION Jesús Piedrafita Arilla jesus.piedrafita@uab.cat Departament de Ciència Animal i dels Aliments Items Logistic regression model Logit

More information

Sample Size Estimation for Studies of High-Dimensional Data

Sample Size Estimation for Studies of High-Dimensional Data Sample Size Estimation for Studies of High-Dimensional Data James J. Chen, Ph.D. National Center for Toxicological Research Food and Drug Administration June 3, 2009 China Medical University Taichung,

More information

Dispersion modeling for RNAseq differential analysis

Dispersion modeling for RNAseq differential analysis Dispersion modeling for RNAseq differential analysis E. Bonafede 1, F. Picard 2, S. Robin 3, C. Viroli 1 ( 1 ) univ. Bologna, ( 3 ) CNRS/univ. Lyon I, ( 3 ) INRA/AgroParisTech, Paris IBC, Victoria, July

More information

Robust statistics. Michael Love 7/10/2016

Robust statistics. Michael Love 7/10/2016 Robust statistics Michael Love 7/10/2016 Robust topics Median MAD Spearman Wilcoxon rank test Weighted least squares Cook's distance M-estimators Robust topics Median => middle MAD => spread Spearman =>

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting Estimating the accuracy of a hypothesis Setting Assume a binary classification setting Assume input/output pairs (x, y) are sampled from an unknown probability distribution D = p(x, y) Train a binary classifier

More information

COMPLEMENTARY LOG-LOG MODEL

COMPLEMENTARY LOG-LOG MODEL COMPLEMENTARY LOG-LOG MODEL Under the assumption of binary response, there are two alternatives to logit model: probit model and complementary-log-log model. They all follow the same form π ( x) =Φ ( α

More information

TMA 4275 Lifetime Analysis June 2004 Solution

TMA 4275 Lifetime Analysis June 2004 Solution TMA 4275 Lifetime Analysis June 2004 Solution Problem 1 a) Observation of the outcome is censored, if the time of the outcome is not known exactly and only the last time when it was observed being intact,

More information