Design of Microarray Experiments Xiangqin Cui
Experimental design Experimental design: is a term used about efficient methods for planning the collection of data, in order to obtain the maximum amount of information for the least amount of work. Anyone collecting and analyzing data, be it in the lab, the field or the production plant, can benefit from knowledge about experimental design. http://www.stat.sdu.dk/matstat/design/index.html
Experimental Design Principles Randomization Replication Blocking Use of factorial experiments instead of the one-factor-at-a-time methods. Orthogonality
Randomization The experimental treatments are assigned to the experimental units (subjects) in a random fashion. It helps to eliminate effect of "lurking variables", uncontrolled factors which might vary over the length of the experiment.
Commonly Used Randomization Methods Number the subjects to be randomized and then randomly draw the numbers using paper pieces in a hat or computer random number generator. 1 2 3 4 5 6 Hormone treatment : (1,3,4); (1,2,6) Control : (2,5,6); (3,4,5) Plan 1 plan 2
Randomization in Microarray Experiments Randomize arrays in respect to samples Randomize samples in respect to treatments Randomize the order of performing the labeling, hybridizations, scanning
Replication Replication is repeating the creation of a phenomenon, so that the variability associated with the phenomenon can be estimated. Replications should not be confused with repeated measurements which refer to literally taking several measurements of a single occurrence of a phenomenon.
Replication in microarray experiments What to replicate? Biological replicates (replicates at the experimental unit level, e.g. mouse, plant, pot of plants ) Experimental unit is the unit that the experiment treatment or condition is directly applied to, e.g. a plant if hormone is sprayed to individual plants; a pot of seedlings if different fertilizers are applied to different pots. Technical replicates Any replicates below the experimental unit, e.g. different leaves from the same plant sprayed with one hormone level; different seedlings from the same pot; Different aliquots of the same RNA extraction; multiple arrays hybridized to the same RNA; multiple spots on the same array.
How Many Replicates Power: the probability for a real positive to be identified It depends on: 1) The size of difference (log ratio) to be detected The bigger the difference, the higher the power. 2) The sample size The larger the sample size, the higher the power. 3) The error variance of the treatment The bigger the error variance, the lower the power.
How Many Replicates * Minimum number of replication (r) per group: 3 5 (two groups) (reliable variance estimates, permutation test, df error, etc.) Factorial 2 x 2 A1 B1 r B2 r A2 r r S.V. A B A*B Error (r=1) 1 1 1 0 (r=2) 1 1 1 4 (r=3) 1 1 1 8 Total 3 7 11
* Sample size calculation for experiments with two groups REFERENCE: n = # arrays (total) 4 ( z + z ) (1 α / 2) ( δ / σ) log(fc) (1 β) 2 2 SD BALANCED BLOCK: n = ( z + z ) (1 α / 2) ( δ / τ) (1 β) 2 2 (in general, τ > σ) Example (Dobbin and Simon, 2003): δ=1, α=0.001, β=0.05: σ 0.50 n = 30 (i.e. 15A + 15B + 30R) τ 0.67 n = 17 (i.e. 17A + 17B)
Error variance (EV) of the estimated treatment effect: TrtA TrtB Reference Design M1 M2 M3 M4 EV = σ m 2 M 2 σ e + mn R R R R Error variance of the estimated treatment effect: Approximately EV = σ e 2 2mn( σ 2 2 σ M σ + e m mn 4 A + σ m : number of mice per treatment n : number of array pairs per mouse 2 e )
Resource allocation: 2 2 σ M σ e EV = + m mn Cost = mc + m M nc A m n C M C A mice / trt array pairs / mouse cost / mouse cost / array pair The optimum number of array pairs per mouse: n σ 2 e = 2 σ M C C M A
Examples for resource allocation Using variance components estimated from kidney in Project Normal data. r = 1 ( no replicated spots on array ) Reference design Mouse price Array price/pair # of array pairs per mouse $15 $600 1 $300 $600 1 $1500 $600 2 More efficient array level designs, such as direct comparisons and loop designs, can reduce the optimum number of arrays per mouse.
Pooling Samples Pooling can reduce the biological variance but not the technical variances. The biological variance will be replaced by: σ 2 1 pool = α k : pool size k σ 2 M α : constant for the effect of pooling. 0 < α < 1 α = 0, pooling has no effect. α = 1, pooling has maximum effect. Note: More pools with fewer arrays per pool is better than fewer pools with more arrays per pools when fixed number of arrays and mice are to be used.
Power Increase to Detect 2 fold change by Pooling ( Pool size k = 3, α = 1 ) 2 pools / trt 4 pools / trt 6 pools / trt 8 pools / trt 2 mice / trt 4 mice / trt 6 mice / trt 8 mice / trt Significance level: 0.05 after Bonferroni correction
PowerAtlas It is an online tool to calculate sample size based on similar data set in GEO or user provided pilot data. http://www.poweratlas.org/ Page et al., BMC Bioinformatics. 2006 Feb 22;7:84. Gadbury GL, et al. Stat Meth Med Res 13: 325-338, 2004.
EDR is the proportion of genes that are truly differentially expressed that will be called significant at the significance level of 0.05. This is analogous to average power. TP is the proportion of genes called 'significant' that are truly differentially expressed. TN is the proportion of genes called 'not significant' that are truly not differentially expressed between the conditions.
Blocking Some identified uninteresting but varying factors can be controlled through blocking. Completely randomized design Complete block design Incompletely block designs
Completely Randomized Design There is no blocking Example Compare two hormone treatments (trt and control) using 6 Arabidopsis plants. 1 2 3 4 5 6 Hormone trt: (1,3,4); (1,2,6) Control : (2,5,6); (3,4,5) Note: Design with one-color microarrays is often completely randomized design.
Complete Block Design There is blocking and the block size is equal to the number of treatments. Example: Compare two hormone treatments (trt and control) using 6 Arabidopsis plants. For some reason plant 1 and 2 are taller, plant 5 and 6 are thinner. 1 2 3 4 5 6 Hormone treatment: (1,4,5); (1,3,6) Control : (2,3,6); (2,4,5) Randomization within blocks
Incomplete Block Design There is blocking and the block size is smaller than the number of treatments. Example: Compare three hormone treatments (hormone level 1, hormone level 2, and control) using 6 Arabidopsis plants. For some reason plant 1 and 2 are taller, plant 5 and 6 are thinner. 1 2 3 4 5 6 Hormone level1: (1,4); (2,4) Hormone level2: (2,5); (1,6) Control : (3,6); (3.5) Randomization within blocks
Designs for two-color microarrays Experimental design for one color microarray array is straight forward with out the required blocking at the array level. For two color microarrays, we need to pair the samples and label then with Cy3 and Cy5 for hybridization to one array. How do we pair?
Blocking in microarray experiments In two-color platform, the arrays vary due to the loading of DNA quantity, spot morphology, hybridization condition, scanning setting It is treated as a block of size 2, the samples are compared within each array. If there are two lots of chips in the experiment and there is large variation across chip lots. We can treat chip lots as a blocking factor. Example: compare three samples: A, B, C A B B C C A Block (array) 1 Block (array) 2 Block (array) 3
A1 Cy5 Cy3 B1
Dye-swap A1 B1 Dye swapping Separates the dye effect from the sample effect. Results are less sensitive to the data transformation processes Kerr and Churchill(2001) Biostatistics, 2:183-201.
Replicated Dye-Swap A1 B1 A2 B2
Reference design All samples are compared to a single reference sample. Single reference Double reference A B C A B C R R
Example Microarray Experiment comparing 3 treatments T1 T2 T3 M1 M2 M3 M4 M5 M6 M7 M8 M9 M10 M11 M12 M13 M14 M15 R R R R R R R R R R R R R R R Note: Arrows are used to represent a two-color microarray. The arrow head represents the red channel and the tail represents the green channel. T1 to T3, the three treatments; M1 to M15, the 15 plants; R, reference sample.
Loop Design A D B C
Basic types of loop designs Single loop Double loop A B A B C C
Complex loop designs Two-factor factorial design at the high level Pera I DBA High fat A B C Low fat E F D A1 B1 A2 B2 C1 F1 C2 F2 D1 E1 D2 E2
Reference or Loop Reference design: It has lower overall efficiency, but it has equal efficiency for all sample comparisons, and is easy to extend. Loop design: higher overall efficiency, harder to extend, unequal efficiency for all comparisons,
Comparing the efficiency of reference versus loop designs Yang and Speed (2002) Nat. Rev. Genet 3:579
Alternative Designs Loop: A 1 B 1 Connected Loop A B C 1 A A 1 B 2 2 B 1 C Regular Loop C 1 C 2 Reference: A A B B Classical Reference A 1 A 2 B 1 B 2 A 1 A 2 B 1 B 2 R R R R R 1 R 2 R 3 R 4 R 1 R 1 R 1 R 1 Common Reference
Experiments With Two Groups A 1 A 2 B 1 B 2 R R R R Reference with dye-swap A 1 A 2 A 3 B 1 B 2 B 3 A 4 B 4 Reference with R R R R R R R R alternating dye Is alternating dye really necessary?
Experiments with Two Groups (Direct comparison) A 1 A 2 B 1 B 2 A 3 A 4 B 3 B 4 Complete block with dye-swap Complete block design; Row-column design A 1 A 2 A 3 A 4 A 5 A 6 A 7 A 8 B 1 B 2 B 3 B 4 B 5 B 6 B 7 B 8
Experiments with Three Groups A 1 A 2 B 1 B 2 C 1 R R R R R C 2 R Reference with dye-swap Reference with alternating dye A 1 A 2 A 3 A 4 B 1 B 2 B 3 B 4 C 1 C 2 C 3 C 4 R R R R R R R R R R R R
Experiments With Three Groups Multiple connected loops A 1 B 1 A 2 B 2 A 3 B 3 A 4 B 4 C 1 C 2 C 3 C 4 Multiple regular loops = Incomplete block structure A A 1 B 2 2 B A 1 1 B 1 = C 1 C 1 C 2 B 2 C 2 A 2
Pooling mice Pooling can reduce the mouse variance but not the technical variances. The mouse variance will be replaced by: σ 2 1 pool = α k : pool size k σ 2 M α : constant for the effect of pooling. 0 < α < 1 α = 0, pooling has no effect. α = 1, pooling has maximum effect. Note: More pools with fewer arrays per pool is better than fewer pools with more arrays per pools when fixed number of arrays and mice are to be used.
Power Increase to Detect 2 fold change by Pooling ( Pool size k = 3, α = 1 ) 2 pools / trt 4 pools / trt 6 pools / trt 8 pools / trt 2 mice / trt 4 mice / trt 6 mice / trt 8 mice / trt Significance level: 0.05 after Bonferroni correction
Design Principle Continues Use of factorial experiments instead of the one-factor-at-a-time methods. Orthogonality: each level of one factor has all levels of the other factor. Example: To compare two treatments (T1, T2) and two strains (S1, S2) S1 S2 T1 T1S1 T1S2 T2 T2S1 T2S2
References Churchill GA. Fundamentals of experimental design for cdna microarrays. Nature Genet. 32: 490-495, 2002. Cui X and Churchill GA. How many mice and how many arrays? Replication in mouse cdna microarray experiments, in "Methods of Microarray Data Analysis III", Edited by KF Johnson and SM Lin. Kluwer Academic Publishers, Norwell, MA. pp 139-154, 2003. Gadbury GL, et al. Power and sample size estimation in high dimensional biology. Stat Meth Med Res 13: 325-338, 2004. Kerr MK. Design considerations for efficient and effective microarray studies. Biometrics 59: 822-828, 2003. Kerr MK and Churchill GA. Statistical design and the analysis of gene expression microarray data. Genet. Res. 77: 123-128, 2001. Kuehl RO. Design of experiments: statistical principles of research design and analysis, 2 nd ed., 1994, (Brooks/cole) Duxbry Press, Pacific Grove, CA. Page GP et al. The PowerAtlas: a power and sample size atlas for microarray experimental design and research. BMC Bioinformatics. 2006 Feb 22;7:84. Rosa GJM, et al. Reassessing design and analysis of two-colour microarray experiments using mixed effects models. Comp. Funct. Genomics 6: 123-131. 2005. Wit E, et al. Near-optimal designs for dual channel microarray studies. Appl. Statist. 54: 817-830, 2005. Yand YH and Speed T. Design issues for cdna microarray experiments. Nat. Rev. Genet. 3: 570-588, 2002.