Design and Analysis of Gene Expression Experiments

Size: px

Start display at page:

Download "Design and Analysis of Gene Expression Experiments"

Teresa Strickland
5 years ago
Views:

1 Design and Analysis of Gene Expression Experiments Guilherme J. M. Rosa Department of Animal Sciences Department of Biostatistics & Medical Informatics University of Wisconsin - Madison

2 OUTLINE Æ Linear Models for Microarray Data w Log-ratio vs. intensities; mixed effects w Shrinkage estimation of variance components w Multiple testing Æ Basic Principles of Experimental Design w Some concepts and definitions w Guidelines for designing experiments Æ Microarray Experiments w Reference and loop designs w Biological and technical replication w Pooling mrna samples Æ Sample Size w Basics, Cost, FDR Æ Additional Topics w Optimal designs, Genetical genomics, RNA-Seq

3 Biological question! Differentially expressed genes! Sample class prediction etc.! Experimental design! Microarray experiment! Image analysis! Normalization! Estimation! Testing! Clustering! Discrimination! Biological verification! and interpretation!

4 Linear Models for Microarray Data

5 Linear Models for Log Ratios (Yang and Speed, 23) A B logg logr G R log y = = A B ; [y] E µ µ β = β A B β = y y E 2 y y 2 β β = y y y E R A B R A µ µ β B A 2 µ µ β y y 2 y 3

6 Linear Models for Log Ratios (Yang and Speed, 23) A B C β β = y y y E y y 2 y 3 A B µ µ β A C 2 µ µ β ð Interest: d AB, d BC and d CA = d AB + d BC 2 g g g g W ] Var[y, X ] [y E σ = β = (Generalized) Least squares or Robust regression

7 Linear Models for Log Ratios (Yang and Speed, 23) T T 3 T 4 T 2 α 2 - α α 3 - α 2 α 4 - α 3 α - α 4 α 3 - α α 2 - α 4 α i : intensity (log scale) on treatment i. θ = α 2 - α θ 2 = α 3 - α θ 3 = α 4 - α y X' X) (X' ˆ θ = ε ε ε ε ε ε + θ θ θ = y y y y y y 2 X) (X' (ˆ) Var σ = θ y y y y 2y ˆ + + = θ ) ~ N(, 2 ε i σ G R log

8 ANOVA Models for Microarray Experiments ð Sources of variation (fixed and random effects) to be considered: Dye Slide (array) Patch or print-tip within slide Spot within patch Print batch of slides Biological variability (individuals or pools) Gene Treatments (experimental groups) Interactions between factors Etc.

9 ANOVA Models (Kerr et al., 2; Kerr and Churchill, 2) log( s adgt ) = y adgt = µ + A a + D d + AD ad + G g + TG tg + AG ag + DG dg + ε adgt Signal (pixels) ε adgt ~ N(, σ 2 ) Array Global Effects Dye Interaction Array Dye Gene Gene-specific Treatment effect Gene-specific Dye effect Gene-specific Array effect Gene-specific Effects

10 Two-Step Mixed Effects Models (Wolfinger et al., 2) st Step: GLOBAL NORMALIZATION y adgt = µ + A a + D d + AD ad + G g + TG tg + AG ag + DG dg + ε adgt y = µ + A + D + AD + e adgt a d ad adgt 2 nd Step: GENE MODELS Predicted Residuals: ê adgt = (y adgt ŷ adgt ) ê adgt = µ g + A ag + D dg + T tg + ε adgt Gene-specific treatment effects 2 ε adgt ~ N(, σ g )

11 ANOVA vs. Two-Step Mixed Effects Models Kerr and Churchill (2): y Xβ + ε = X β + X β + ε = 2 2 ε ~ N(, R), R = I N 2 σ ε, n = k g= n g Note: Independence across genes; homogeneous variances; fixed effects model. Wolfinger et al. (2): y = X β + + e eˆ g = Xgβg + Zgug εg u g ~ N(, G ), G = I σ g g q g 2 u(g) ε g 2 ~ N( g, Rg), Rg = In σε(g) g Note: Independent analyses across genes; gene-specific variances; additional random effects may be necessary (Rosa et al., 25).

12 Single Step Mixed Effects Models (Hoeschele and Li, 25) y X β + Z u + X β + Z u + ε = u ~ N(, G ), G = Iσ 2 u u ~ N(, G ), G = = I σ k g na u2 (g) ε k 2 ~ N(, R), R = g = In σε(g) g Note: Independence across genes; Similar results with two-step implementation if: a) balanced data for each gene, b) R = Iσ 2, and c) large number of genes.

13 Shrinkage Estimators of Variance Components Complete shrinkage (homogeneous variances): y 2 = Xβ + Zu + ε, with u ~ N(, I σ ) and ε ~ N(, I q u N σ 2 ε ) Note: Ranking of genes based on fold-change (consequence: selection of more variable genes ) No shrinkage at all (gene specific variances): y ind 2 = Xβ + Zu + ε, with u ~ N(, I σ ) and ε ~ N(, I g q u(g) g ind n(g) σ 2 ε(g) ) Note: Poor estimates of variance components, especially for smaller experiments Intermediate methods: Combining information across genes (e.g. regularized t-test; empirical Bayes approaches, etc.)

14 SHRINKAGE APPROACHES ð Efron et al. (2): E.g.: 9 th percentile of standard deviations t * = n M a + s something midway between a common and a gene specific standard error SAM: Significance Analysis of Microarrays ð Lönnstedt and Speed (2): Data from all genes are combined into estimates of a prior distribution These parameters are then combined at the gene level with mean and standard deviation to form a statistic B (Bayes log posterior odds)

15 SHRINKAGE APPROACHES ð LIMMA: Linear Models for Microarray Data Smyth GK (24) Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology 3, No., Article 3. ð MAANOVA: MicroArray ANalysis Of VAriance Cui XG, Hwang JTG, Qiu J, Blades NJ and Churchill GA (25) Improved statistical tests for differential gene expression by shrinking variance components. Biostatistics 6(): ð Variance Component Models: (Wolfinger and Kass, 2) Feng S, Wolfinger RD, Chu TM, Gibson GC and McGraw LA (26) Empirical Bayes analysis of variance component models for microarray data. Journal of Agricultural Biological and Environmental Statistics (2):

Graphical Display of the Baseball Data.4.3.

16 Graphical Display of the Baseball Data Observed Values EB Estimates Final Values Observed Mean (Casella 985)

17 Improved Statistical Tests for Differential Gene Expression by Shrinking Variance Components Cui X et al. Biostatistics 6(): 59-75, 25 Test Statistics (F Statistics): F S : Shrinkage estimator of error variance F : Gene-specific F test F 2 : Hybrid statistic using averages of the individual and pooled variances F 3 : Pooled-variance F statistic

18 Constructing F like Statistics General F statistic for a linear mixed model: Linear mixed model: residuals observations design matrices fixed and random effects The variance-covariance matrix of (BLUE) and (BLUP) can be estimated as: Linear combinations of the fixed effects can be tested using an F statistics constructed as:

19 Constructing F -like Statistics General linear mixed model for gene g: F : Gene-specific F; variance components estimates using data only from gene g F 3 : Uses the pooled variance estimator for each variance component F 2 : Uses the average of and for each component F S : Uses from the shrinkage estimator given before as the variance component estimator for each gene

20 MULTIPLE TESTING PROBLEM

21 HYPOTHESIS TESTING (Statistical Errors) Significance level H is not rejected H is rejected H is true No error (-α) Type I error (α) H is false Type II error (β) No error (-β) Æ Standard approach: Power Specify an acceptable type I error rate (α) Seek tests that minimize the type II error rate (β), i.e., maximize power ( - β)

22 THE PROBLEM OF MULTIPLE TESTING Suppose you carry out hypothesis tests at the 5% level (assume independent tests ) The probability of declaring a particular test significant under its null hypothesis is.5 But the probability of declaring at least of the tests significant is If you perform 2 hypothesis tests, this probability increases to.642 Æ Typically thousands of genes simultaneously Æ Type I error rate (α) Suppose: Self-self hybridization with g = genes; and α = 5% (for each test) Expected.5 = 5 false positives

23 DISTRIBUTION OF P-VALUES Under H Mixture of H and H a

24 Biological question! Differentially expressed genes! Sample class prediction etc.! Experimental design! Microarray experiment! Image analysis! Normalization! Estimation! Testing! Clustering! Discrimination! Biological verification! and interpretation!

25 To consult the statistician after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of. Fisher (938)

26 GUIDELINES FOR DESIGNING EXPERIMENTS (Montgomery, 25). Recognition of and statement of the problem 2. Selection of the response variable 3. Choice of factors, levels, and ranges 4. Choice of experimental design 5. Performing the experiment 6. Statistical analysis of the data 7. Conclusions and recommendations

27 DESIGN OF MICROARRAY EXPERIMENTS Legend: Treatment A, Treatment B Treatment A, Treatment B Block Block 2 Block 3 ANOVA Table Source of Variation DF Block 2 Trt 3 Residual 6 Total

28 SINGLE CHANNEL MICROARRAY EXPERIMENT Block Block 2 Block 3 ANOVA table for each gene SV Block Trt Residual Total DF 2 3 6

29 TWO COLOR MICROARRAY EXPERIMENT Block Block 2 Block 3 ANOVA (mixed effects model) will depend on both the preexisting and the microarray designs (Rosa et al. 25): y = µ + Dye + Block + Array(Block) + Trt + Sample(Trt) + ε Additional effects can be included in the model depending also on the array structure (print-tip, replicated spots per clone, etc.) and other systematic nuisance effects

30 BASIC PRINCIPLES OF EXPERIMENTAL DESIGN Randomization Averaging out extraneous, non-controlled factors Replication Assessing experimental error Blocking Restriction on randomization

31 Fictitious Greenhouse Experiment Goal: To compare total leaf area of plants in two experimental conditions (Control and Treatment) Null Hypothesis: Similar leaf area in both groups Single plant in each group, three leaves per plant Control Treatment Control Treatment t = 4.82 (p =.85)

32 Fictitious Greenhouse Experiment Randomization within blocks s T s C xc xt Biological replication Blocking side of greenhouse

33 EXPERIMENTAL DESIGN PRINCIPLES IN GENE EXPRESSION PROFILING Randomiza)on Biological vs. technical replica)on Blocking (mul)ple samples per slide/flow cell)

34 BIOLOGICAL VS. TECHNICAL REPLICATION Consider an experiment where two groups are compared, and n subjects within each treatment are measured m times. Assume that the data generation process can be described as: observed trait general constant treatment effect residual term, with variance random effect of subjects within treatments, with variance (Rosa et al., 25)

35 BIOLOGICAL VS. TECHNICAL REPLICATION Analysis I: Ignoring the two different levels of replication

36 BIOLOGICAL VS. TECHNICAL REPLICATION Analysis II: Considering V v as fixed

37 Multiple Subjects Multiple Samples Multiple Slides Multiple Spots

39 Figure : Alternative reference designs with three treatments (A, B and C) and 2 arrays: a) Common reference design (CRD) with four replicates (subindexes); reference sample (R) is the same in all arrays; b) Classical reference design (ClRD) with four replicates (subindexes) in 2 arrays; reference sample (R) is replicated; and c) Replicated reference design (RRD), in which six replications of treatments B and C are hybridized together with independent replicates of a control treatment (A).

40 Figure 2. Relative efficiency of the replicated reference design (RRD) as compared to the common reference design (CRD), considering: a) the same number of replicates (r), and b) the same number of arrays (n). X-axis is the log2 of the variance ratio; curves correspond to different numbers of treatments (K).

42 BLOCKING (Restriction on Randomization) Æ COMPLETELY RANDOMIZED DESIGN Æ COMPLETE BLOCK DESIGN Æ INCOMPLETE BLOCK DESIGNS

43 COMPLETELY RANDOMIZED DESIGN Æ Example: w Three crop varieties (A, B and C) to be compared w Six homogeneous plots A B B C A C

44 COMPLETE BLOCK DESIGN Æ Example: w Three crop varieties (A, B and C) to be compared w Two blocks of three homogeneous plots each A B C C A B ð Randomization within blocks ð Block size equal to number of varieties

45 What if block size not equal the number of treatments Æ Example: w Three crop varieties (A, B and C) to be compared w Three blocks of two homogeneous plots each A vs B A B C Indirect Comparison R R R ð R: Additional variety (Reference) ð Indirect comparison between varieties A, B and C

46 What if block size not equal the number of treatments Æ Example: w Three crop varieties (A, B and C) to be compared w Three blocks of two homogeneous plots each A B C A vs B Direct Comparison B C A A vs B Indirect Comparison ð Two replications for each variety ð Smaller variances of variety differences (2/3 relatively to reference design)

47 REFERENCE AND LOOP STRUCTURES Target samples: experimental groups REFERENCE A R B R C R D R E R Reference sample LOOP, CIRCULAR A B B C C D D E E A

48 USUAL GRAPHICAL REPRESENTATION Hybridization (slide) A B A B ð Comparisons: C mrna samples Cy3 (Green) Cy5 (Red) Direct (more precise) A B log 2 (A/B) Indirect (less precise) A C B log 2 (A/C) - log 2 (C/B)

49 USUAL GRAPHICAL REPRESENTATION REFERENCE A B C D E LOOP A E B R R R R R D C

50 (Yang and Speed, 22)

51 ALTERNATIVE DESIGNS Loop: A B Connected Loop A B C A A B 2 2 B C Regular Loop C C 2 Reference: A A B B Classical Reference A A 2 B B 2 A A 2 B B 2 R R R R R R 2 R 3 R 4 R R R R Common Reference

52 EXPERIMENTS WITH TWO GROUPS A A 2 B B 2 R R R R Reference with dye-swap A A 2 A 3 B B 2 B 3 A 4 B 4 Reference with R R R R R R R R alternating dye Is alternating dye really necessary?

53 EXPERIMENTS WITH TWO GROUPS A Single loop B 4 B A 4 A 2 B 3 A 3 B 2 Notice: Direct comparisons between the two groups

54 EXPERIMENTS WITH TWO GROUPS A A 2 B B 2 A 3 A 4 B 3 B 4 Complete block with dye-swap Complete block design; Row-column design A A 2 A 3 A 4 A 5 A 6 A 7 A 8 B B 2 B 3 B 4 B 5 B 6 B 7 B 8

55 EXPERIMENTS WITH THREE GROUPS A A 2 B B 2 C R R R R R C 2 R Reference with dye-swap Reference with alternating dye A A 2 A 3 A 4 B B 2 B 3 B 4 C C 2 C 3 C 4 R R R R R R R R R R R R

56 EXPERIMENTS WITH THREE GROUPS A B C C 2 B 2 A 2 Notice: Still direct comparisons between each group

57 EXPERIMENTS WITH THREE GROUPS Multiple connected loops A B A 2 B 2 A 3 B 3 A 4 B 4 C C 2 C 3 C 4 Multiple regular loops = Incomplete block structure A A B 2 2 B A B = C C C 2 B 2 C 2 A 2

58 EXPERIMENTS WITH MORE THAN THREE GROUPS A B Notice: May not have all possible direct comparisons C A 2 E D May want to include interwoven connections

59 A 2 x 2 x 2 A-optimal design allowing for biological replication 2 ages (A and a) x 2 sexes (B and b) x 2 strains (D and d) = 8 groups 2 biological / group 6 samples (pools) 32 arrays Power considerations may dictate need of another 6 samples with reverse fluor. (Churchill and Oliver, 2)

60 Sweetpotato Plant Response to Sweet Potato Virus Disease Treatments: T: uninfected plants T2: plants infected with SPFMV-RC alone T3: plants infected with SPFMV-C alone T4: plants infected with SPCSV alone T5: plants infected with SPFMV-RC and SPCSV together T6: plants infected with SPFMV-C and SPCSV together (McGregor et al., 29) Such treatment structure can be viewed as a 2 x 3 factorial, as following: Virus Virus 2 (SPCSV) None SPFMV-RC SPFMV-C No T T2 T3 Yes T4 T5 T6

61 Repeated measurements (longitudinal study) with 5 time points: Days After Infection (DAI) Treatment T T2 T3 T4 T5 T6 3 biological replications per treatment x time combination

62 6 slides available; contrasts of interest: ) treatments within time points, and 2) time points within treatment Possible Alternative: DAI Treatment T T2 T3 T4 T5 T6

63 EXPERIMENTS WITH MORE THAN THREE GROUPS BALANCED INCOMPLETE BLOCK DESIGNS (BIBD) - All pair of treatments occur together the same number of times - Let a: number of treatments, b: number of blocks, k: block size a = k a! k!(a k)! - BIBD: blocks with different combinations of treatments to each block. - But balance can be obtained with fewer blocks as well - Total number of observations = bk = ar, r: # observations/trt - Number of times each pair of trt appear in the same block: λ = r(k ) a

64 POOLING OF mrna SAMPLES s Not enough RNA available from each subject (unit) s Limited number of arrays Fixed number of subjects (r = 3) 2 3 better than better than Fixed number of slides (n = 3) better than α B 2 F Issues: outliers; clustering analysis; σ = σ ( α ) pool k

65 SAMPLE SIZE HYPOTHESIS TESTING H is not rejected H is rejected Significance level H is true No error (-α) Type I error (α) H is false Type II error (β) No error (-β) Æ Standard approach: Power Specify an acceptable type I error rate (α) Seek tests that minimize the type II error rate (β), i.e., maximize power ( - β)

66 SAMPLE SIZE * Minimum number of replication (r) per group: 3-5 (two groups) (reliable SD estimates, permutation test, df error, etc.) Factorial 2 x 2 A A2 B r r B2 r r S.V. (r=) (r=2) (r=3) A B A*B Error 4 8 Total 3 7

67 SAMPLE SIZE * Sample size calculation for experiments with two groups REFERENCE: n = # arrays (total) 4 ( z + z ) ( α / 2) ( δ / σ) log(fc) 2 ( β) 2 SD BALANCED BLOCK: n = ( z + z ) ( α / 2) ( δ / τ) ( β) 2 2 (in general, τ > σ) Example (Dobbin and Simon, 23): δ=, α=., β=.5: σ.5 n = 3 (i.e. 5A + 5B + 3R) τ.67 n = 7 (i.e. 7A + 7B)

68 SAMPLE SIZE COST (Cui and Churchill, 23): * m technical replicates (measurements) on n pools of k subjects each: MSE σ n k 2 B = α + ( σ 2 A + σ m 2 e / r) Biological variance Technical variance Residual variance Cost = n k C U + n m C M Cost of each experimental unit cost of each measurement

69 Æ Set-up: Testing m null hypothesis H j (j =,,m) (m o true and m false null hypothesis; R: n o H rejected (false positives) N o H not rejected N o H rejected N o true H A B m N o false H C D m m R R m Observable quantity (n o rejected H ) known quantity Family-wise error rate (FWER): FWER = Pr(B ) = Pr(B = ) False discovery rate (FDR): FDR = E[B/ R R > ]Pr(R > ) Positive FDR (pfdr); Storey (22)

70 SAMPLE SIZE * In the context of multiple testing: Gadbury et al. (24) TP D =, TN = C + D D EDR = B + D A, A + B n n* τ p-value t t * p-value * TP TN EDR Other methods (FDR-based): Muller et al. (24), Hu et al. (25) and Jung (25)

71 FDR * Liu and Hwang (27): Prior specification of: Maximal FDR (α) Required power ( - β) Proportion of non-differentially expressed genes (π ) Noncentrality parameter (λ) Unit variance (σ 2 ) Then a constant c is chosen to satisfy: where F k;d(n) (c) : cdf of the F distribution with k and d(n) degrees of freedom F : cdf of the noncentral F distribution with noncentrality parameter λ k;d(n); λ (c) k: number of treatment contrasts of interest d(n): residual degrees of freedom F F k;d(n) k;d(n); λ n: number of sets used (set is a partial treatment design which will be replicated to produce the complete design) Using this c, the expected power is: β = Fk ;d(n); λ (c) and n can be chosen to achieve the required expected power. (c) (c) α( π) = ( α) π

72 GENERAL STATISTICAL PRINCIPLES OF THE DESIGN OF EXPERIMENTS General procedure for choosing a good design: (Bailey 98, 998 and Mead 99). Choice of a set of treatments (groups to compare) given the objectives of the experiment. 2. Identification of the experimental units and choice of blocking structure based on expected pattern of variability among them. 3. Identification of any restrictions on which treatments can be applied to units. 4. Construction of design using a combinatorial, algorithmic or ad hoc method.

73 OPTIMAL DESIGNS w Model: y = Xθ + e w Information on θ is proportional to X X, thus, some suitable function of this information should be optimized. w Kiefer (959) alphabetic optimality criteria

74 Var[ˆ] θ = ( X' X) σ 2 = M σ 2 w A: Minimize the average of the variances (Trace of M - ) w D: Minimize the hypervolume of the C.R. for θ (Determinant of M - ) w E: Minimize the variance of the worst estimated estimable function of θ (First Eigenvalue of M - ) w Others: Subset θ s of θ: A s or D s ; Decision Theoretic Approach: Lindley (972), Tiao and Afonja (978), Chaloner and Verdinelli (995)

75 n = v + 2 A-optimal designs (Kerr and Churchill, 2) n = 2v

76 (Yang and Speed, 22)

77 (Yang and Speed, 22)

78 DYE-SWAP vs. DYE BALANCE Wit et al. (25)

4 3 2 5 6 7 8 9 3 half sib families of size 3 =.25.

79 half sib families of size 3 = A Design for Genetically Related Treatments (Example with Hierarchical Mating)

Briefings in Bioinformatics 2(3): 28-287, 2.

80 Briefings in Bioinformatics 2(3): , 2. RNA-SEQ SPECIFIC EFFECTS AND BLOCKING Nuisance factors: processing date, technician, and reagent batch Others: library preparation effect; variation between flow cell, and variation between lanes within flow

81 SEQUENCING DEPTH Sequencing depth vs. sample size Power law: Bashir et al (2): >9% transcripts sampled with 6 sequence reads General recommendation: run small pilot sequencing to estimate the distribution of all transcripts

of mapped reads transcript frequency for this gene X ijk: count

82 SAMPLE SIZE CALCULATION Technical replicates, with negligible library preparation and lane effects (Poisson model) Total number of mapped reads transcript frequency for this gene X ijk: count from treatment i (i=,2), replicate (lane) j (j=,2,,n i ), for gene k.

83 SAMPLE SIZE CALCULATION To obtain a power (- β) under the alterna)ve hypothesis at the level of significance α, by using a Wald- type Z- sta)s)cs, we can have the rela)on: where, and are normal quan)les.

84 SAMPLE SIZE CALCULATION Approximate sample sizes equal to: Note: values of α, β and ρ are pre-selected values of d and, and can be determined using preliminary data Warning: sample size calculation based on the Poisson model above is the most optimistic scenario

85 SAMPLE SIZE CALCULATION Calculations with biological replication: over-dispersion issue, Negative Binomial distributions No analytical derivation and Monte Carlo simulation necessary Calculation discussed for a single gene; Potential alternatives in the context of multiple testing: - Obtain sample sizes for one gene and then determine the overall sample size based on the overall average power - Set effect size, number of non-differential genes, and the expected number of false positives - Model P-values as a mixture of distribution from genes that are not differentially expressed and genes that are differentially expressed

86 VALIDATION Validation with independent samples and (not necessary) using different technology, e.g. qrt-pcr

Design of Microarray Experiments. Xiangqin Cui

Design of Microarray Experiments Xiangqin Cui Experimental design Experimental design: is a term used about efficient methods for planning the collection of data, in order to obtain the maximum amount