Optimal design of microarray experiments

Size: px

Start display at page:

Download "Optimal design of microarray experiments"

Gilbert Douglas
6 years ago
Views:

1 University of Groningen ernst 7 June 2011

2 What is a cdna Microarray Experiment? GREEN (Cy3) Cancer Tissue mrna Mix tissues in equal amounts G G G R R R R R G R G G Hybridisation RED (Cy5) Normal Tissue mrna cdna microarray measures the relative abundance of RNA in two organic tissues. First, cdna labelled with G & R dyes is obtained from 2 conditions and mixed in equal proportions Second, the mixture is hybridized with the arrayed DNA (probes) Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Enlargement Third, competitive hybridization! Probe catches the matched cdna R g3 R R R g3 R R Gene 3 matches with normal tissue mrna -> high red signal Finally, G & R read-out for each spot (gene) by scanning

3 Why DOING a microarray experiment? Obvious, isn t it? To discover genes that are differentially expressed between tissues. a typical genomic profile of a particular disease. differences between phenotypic similar pathologies i.e., to find some differences.

4 Why DESIGNING a microarray experiment? Yeah, why? We want to be sure that the the differences we find are due to biological differences (e.g. different tissue, different diseases); not due to other experimental conditions (e.g. microarray batch, print-pin anomalies, etc.); not due to chance variation. Our exp t should maximize difference between biological conditions.

5 4 known design principles 1. Vary the conditions of interest, with replicates that are as many as possible; biological rather than technical replicates; as evenly distributed across the conditions of interest. 2. Restriction: Keep conditions not of interest constant. 3. Blocking: Keep track of uninteresting conditions that do vary. 4. Matching: Channels are similar units for effective comparisons.

6 Sources of Variation Microarrays are known as noisy experiments. Many artifacts are the result of the experimental conditions: microarray print batches lab experimenter hybridization mixture preparation hybridization conditions We follow an experiment in detail to determine sources of variation.

7 1. Start of microarray experiment Organic material is typically stored frozen to prevent decay. Preparing hybridization mixture takes 1-2 hours. Preparation of mixture is intensive, hands-on process. Scientist decides to prepare 2 of 20 arrays in parallel. 2 arrays = 4 channels. 4 separate tubes are prepared for the 4 channels. The following description applies to each of the 4 tubes.

8 2. Obtaining RNA Several plants of a particular condition are selected. Scientist uses pestle and mortar to grind it down to fine mush. Organic material is full of impurities (proteins): A membrane is used to fish out the RNA. Spectrophotometry to detect: RNA > 2 Clean Proteins Remaining mixture is concentrated using a centrifuge twice. At the end, 100 µg RNA is obtained.

9 3A. Obtaining stable cdna RNA is very unstable. RNA polymerase enzyme copies RNA into cdna, after sequences of Ts (poly-t primer) is added. quantity of 25µl of each nucleotide (A,T,G,C) is added magnesium and salt buffer is added.

10 3B. Preventing RNA folding RNA is heated up to 65 C to prevent it from folding onto itself.

11 3C. Obtaining stable, labelled cdna RNA can be made visible by copying a dye into the cdna. One adds C nucleotides with a dye molecule (Cy3 or Cy5) attached. The reverse transcriptase enzyme is added for an hour.

12 4. Preparing cdna mixture for hybridization Loose A, C, T, G bases are filtered out of cdna mixture. cdna mixture is dried down in centrifuge Original liquid replaced by special liquid ( hybridization buffer. ) Scientist combines 2 tubes together in equal quantities. (This results in two hybridization mixtures for two arrays)

13 5. Slide hybridization The microarray is washed with SSC and SDS. cdna mixture put on the slide & coverslip put on top in order to spread cdna mixture evenly get rid of air bubbles Slide is left in hybridization chamber (dark) at 42 C for hours.

14 6. Post-processing slide All non-hybridized (labelled) cdna is washed from the slide. Finally, the microarray is dried and is ready to be scanned.

15 Sources of variation in a microarray expt... Sources of variation in the hybridization process Different dye sensitivities. Inequalities in the application of mrna to the slide. Variations in the washing efficiencies of non-hybridized mrna off the slide. Other differences in hybridization parameters, such as: temperature, experimenter, time of the day,... Sources of variation in the mrna Differences in conditions. Differences between experimental subjects within the same covariate level. Differences between samples from the same subject. Variation in mrna extraction methods from original sample. Variations in reverse transcription. Differences in PCR amplification. Different labelling efficiencies. Sources of variation in the microarray production Print-pin anomalies Variations in printed probe quantities even with the same pin Chip batch variation (due to many sources of unknown variations) Differences in sequence length of the immobilized DNA Variations in chemical probe attachment levels to the slide Sources of variation in the scanning Different scanners; Different photo-multipliers or gain. Different spot-finding software; Different grid alignments.

16 A statistical toy model for gene expression log y gcbaxdpkr = θ gc +B b +A ba +L a (x)+d ad (θ gc )+P p +S ag +ɛ gck +η gckr. where θ gc = true expression for gene g under condition c B b = RNA Batch effect (experimenter, time of day, temperature) A ba = array effect (scanning level, pre/post-washing) L a (x) = location effect (chip, coverslip, washing) D ad (θ gc ) = dye effect (dye, unequal mixing of mixtures, labelling, intensity) P p = print pin effect S ag = spot effect (amount of DNA in the spot printed on slide) ɛ gck = biological variation for individual k η gckr = within-replicate-k variation of a technical replicate r

17 Normalization The proposed model: log y gcbaxdpkr = θ gc +B b +A ba +L a (x)+d ad (θ gc )+P p +S ag +ɛ gck +η gckr. It is hoped that the structural artifacts, B b, A ba, L a (x), D ad (θ gc ), P p can be eliminated by means of normalization. This leaves us a model for gene g at condition c on array a for the r th technical replicate of individual k, log y gcakr = θ gc + S ag + ɛ gck + η gcakr, where η gcakr effects. includes non-structural artifacts from nuisance

18 Five questions (one we answer four you will) From the model: Cy3: log x ikr = θ i + S + ɛ ik + η ikr Cy5: log x jkr = θ j + S + ɛ jk + η jkr, We define 5 microarray design problems: 1. How to estimate differential expression δ ij = θ i θ j? 2. How to deal with biological variation ɛ ik if we have too few replicates?... if we have too many replicates? 3. What if the θ possess some factorial structure? 4. What if we are interested in some functional form θ = θ(t, ϕ)?

19 Example: two different designs with 5 microarrays Question: Which design is more efficient in estimating the contrasts? Reference Design Loop Design R Note: each arrow stands for 1 microarray (green dye red dye).

20 Developing our intuition... The principal quantity of interest in microarray experiments is: the level of differential expression across the conditions δ ij Example: Let δ 13 = true log-difference in expression between 1 and 3. How good can we estimate δ 13 in both designs, assuming that each observation y has variance σ 2? NOTE: Variance σ 2 = σ 2 b + σ2 t consist both of biological and technical variation.

21 An estimate ˆδ 13 of δ 13 in a Reference Design Let y i = observed log-difference on microarray i. Reference Design 1 y 1 y 3 = (log C 1 log Ref) (log C 3 log Ref) = ˆδ 13, 5 4 R 3 2 so ) V (ˆδ13 = V (y 1 ) + V (y 2 ) = 2σ 2, where σ 2 is variance of single log-ratio V (y i ).

22 So, the loop design has a more precise Optimal estimate design of microarray of theexperiments log-ratio δ. An estimate ˆδ 13 of δ 13 in a Loop Design Let y i = observed log-difference on microarray i 5 Loop Design 1 2 y 1 y 2 = log C 1 C 2 log C 2 C 3 = ˆδ (1) 13 y 4 + y 3 y 5 = log C 1 C 5 log C 5 C 4 log C 4 C 3 = ˆδ (2) ˆδ 13 = 0.6 ˆδ (1) ˆδ (2) 13, so ) V (ˆδ13 = σ σ 2 = 1.2σ 2, where σ 2 is the variance of a single log-ratio

23 Two major questions Are there better designs than loop designs? = Sometimes Are loop designs always more efficient than reference designs? = Not always So how can we find optimal designs?

24 Some results from the literature... These are optimal even designs for all the contrasts when the number of arrays is two more than the number of conditions. (Kerr and Churchill 2001; exhaustive search) The problem with these and similar results not of practical interest. no indication how this result can be extended.

25 Schematic representation of a microarray exp t Loop Design y 1 y 2 y 3 y 4 y 5 = δ 21 δ 31 δ 41 δ 51 + ɛ Reference Design R 3 2 y 1 y 2 y 3 y 4 y 5 = δ 1r δ 2r δ 3r δ 4r δ 5r + ɛ

26 Outline of Near Optimal Design Algorithm Wit, Nobile and Khanin (2005) proposed the following algorithm: 1. Formalize design notation X, y = X δ + ɛ 2. Estimate the parameters of interest for design X, ˆδ ij X = (X t X ) 1 X t y, i > j 3. Summarize efficiency of parameter estimates, e.g. average V (ˆδ ij X ) 4. Search design X that attains highest efficiency, X = arg min X average V (ˆδ X ij ).

27 Design Optimality (1) Two ways to translate covariance matrix into univariate summary: A-optimality = minimizing Trace [ (X t X ) 1]. Minimizes the sum of the variances of the parameter estimates. D-optimality = minimizing Det [ (X t X ) 1]. Minimizes the generalized variance of the parameter estimates. PROBLEM: A-optimality depends on parametrization.

28 Design Optimality (2) The parameters of interest are ALL CONTRASTS δ = (δ ij ) i<j. Note: δ ij = δ i1 δ j1, so therefore: There is a matrix C, such that δ = C δ 21 δ 31 δ 41 δ 51 L-optimality: minimize Trace[C(X t X ) 1 C t ]. minimize average variance of all contrasts. This is equivalent to A-optimality with j δ j1 = 0 constraint.

29 Simulated Annealing Kirkpatrick et al. (1983) developed simple algorithm to minimize f : 1. Let X be the current state 2. Propose a new state X 3. Accept the new proposal with probability: { ( ) } f (X ) 1/T min 1, f (X, ) where T is the current temperature. 4. Slightly lower the temperature and return to 1. This procedure is valid if the proposals are symmetric; This procedure reaches the optimum if the chain is irreducible.

30 Simulated Annealing: Design Move 1 Single vertex move. Pick at random one comparison (between i and j). Pick at random a condition k. Replace comparison between i and j by comparison between i and k.

31 Simulated Annealing: Design Move 1 Single vertex move. Pick at random one comparison (between i and j). Pick at random a condition k. Replace comparison between i and j by comparison between i and k.

32 Simulated Annealing: Design Move 1 Single vertex move. Pick at random one comparison (between i and j). Pick at random a condition k. Replace comparison between i and j by comparison between i and k.

33 Simulated Annealing: Design Move 2 Single edge move. Pick at random one comparison (i and j). Pick at random two conditions, l and k. Replace comparison between i and j by comparison between l and k.

34 Simulated Annealing: Design Move 2 Single edge move. Pick at random one comparison (i and j). Pick at random two conditions, l and k. Replace comparison between i and j by comparison between l and k.

35 Simulated Annealing: Design Move 2 Single edge move. Pick at random one comparison (i and j). Pick at random two conditions, l and k. Replace comparison between i and j by comparison between l and k.

36 Simulated Annealing: Design Move 3 Balanced two-edge move. Pick at random two non-overlapping comparisons (i & j and k & l). Replace comparison i and j by comparison i and l. Replace comparison k and l by comparison k and j.

37 Simulated Annealing: Design Move 3 Balanced two-edge move. Pick at random two non-overlapping comparisons (i & j and k & l). Replace comparison i and j by comparison i and l. Replace comparison k and l by comparison k and j.

38 Simulated Annealing: Design Move 3 Balanced two-edge move. Pick at random two non-overlapping comparisons (i & j and k & l). Replace comparison i and j by comparison i and l. Replace comparison k and l by comparison k and j.

39 Loop designs are often A-optimal Stochastic Search of A Optimal Design for Contrasts with Simulated Annealing. Stochastic Search of A Optimal Design for Contrasts with Simulated Annealing. Acceptance rate = 36.4 %. Score := Tr[ Inv(X X) ] = 28. Nr of iterations = Acceptance rate = 37.1 %. Score := Tr[ Inv(X X) ] = 42. Nr of iterations = Nr of arrays = 7. Nr of conditions = 7 Nr of arrays = 8. Nr of conditions = 8 A-optimal designs for 7/8 conditions with 7/8 slides, respectively.

40 ...but not always Stochastic Search of A Optimal Design for Contrasts with Simulated Annealing. Acceptance rate = 36.3 %. Score := Tr[ Inv(X X) ] = Nr of iterations = Nr of arrays = 9. Nr of conditions = 9 A-optimal design with 9 conditions and 9 microarrays.

41 In practice: cyclic designs An interwoven loop design is defined by: number of conditions (HERE: 15) number of loops (HERE: 3) jump size for each loop (HERE: 1, 4, 6) However, cyclic designs are typically not A-optimal for p 17.

42 Advantages of an Interwoven Loop design A Optimality Score for Contrasts: Tr[ Inv(X X) ] = No of arrays = 90. No of conditions = 30 Easy to implement Compare left with right! Highly efficient Left is only 0.6% more efficient than Right Automatic dye balance conditions measured equally in Cy3/Cy5

43 Use od: optimal (interwoven loop) designs in R The function od(nt, ns, optimality =..., method = "loop") finds L-optimal/D-optimal interwoven loop/all designs for any number of conditions and any number of slides Can be obtained from smida library at: wite/book Some conclusions...

44 Dye Swap or Dye Balance? The simple model needs to be expanded for a possible dye effect: log y 1kr = θ 1 + S + +ɛ 1k + η 1kr log y 2kr = θ 2 + S + D + ɛ 2k + η 2kr, Dye swap designs have been proposed as way to control dye effect. HOWEVER, interwoven loop designs are more efficient! D A 5 cond. 10 arrays 89% 80% 6 cond. 12 arrays 87% 75% 7 cond. 14 arrays 85% 69% 8 cond. 16 arrays 82% 62% Relative D- and A-efficiency of dye-swap versus interwoven loop.

45 What other 4 design questions can be raised? 1. Too little biological material: technical replicates. 2. Too many biological material: pooling. 3. Factorial designs 4. Kinetic models We briefly describe these problems in what follows. You can work on them during the afternoon.

46 Case 1: not enough biological material... Typically, Wit et al. (2005) and others assumed: number of biological replicates exceeds the number of channels. Only then the above-mentioned optimal design -algorithm is applicable. What if number of channels exceeds number of biological replicates? In that case, we have to use technical replicates. New Question: How to assign the technical replicates to the arrays?

47 A simple practical example Consider an experiment with 3 conditions and 6 arrays. Each condition has 2 biological and 2 technical replicates. Dye swap an alternative Same design matrix, but different efficiency of parameter estimates.

48 Mixed effect models The mixed effect models can be written as: y = X θ + Z bio ɛ + η, where Z: the random effect design matrices ɛ random replication effect (of no interest) So, Var(y) = σ 2 b Z t b Z b + σ 2 t I.

49 Covariance-variance matrix Σ Σ = σb σt 2 I Σ = σb If only independent biological replicates, then Σ = (σ 2 b + σ2 t )I. That is the scenario in Wit et al. (2005). + σt 2 I

50 Generalized least squares The linear model is modelled by y = X δ + ɛ, where Var(ɛ) ρσ B + I is the variance structure of the residual. The generalized least squares estimates of δ: and so ˆδ = (X t Σ 1 X ) 1 X t Σ 1 y, V (ˆδ) (X t Σ 1 X ) 1. Under A-optimality we aim to minimize the f (X, Σ) = Tr(X t Σ 1 X ) 1

51 Case 2: Too many replicates Assume the following model for single gene & single condition: log y 1kr = θ 1 + S + ɛ k + η kr, where V (ɛ k ) = σɛ 2 : biological variation: associated with individual tissue V (η kr ) = ση: 2 technical variation: associated with microarray If p biological replicates are mixed before hybridization to array. Let x i be the expression of the ith pool, then V ( x i ) = σ2 ɛ p + σ2 η. Pool has less (biological) variation than each single tissue.

52 Pooling is not free: optimal pooling We want to minimize the estimation error: σ 2 ɛ np + σ2 η n, subject to not overrunning one s budget B = n p C s + n C a, C s = costs individual sample extraction C a = cost microarray (including dyes) Euler-Lagrange: Find the stationary points of h(n, p, λ) = σ2 ɛ np + σ2 η n λ (n p C s + n C a B).

53 Case 3: multi-factorial experiments (Simplified) example: Caetano et al. Genetics 2004, 168: Objective: Identify differentially expressed genes in pigs between 2 factors: Ovulation rate (2 levels: high, normal) Time after inducing luteal regression (4 levels: 2, 3, 4, 6 days) 2 4 = 8 factor combinations Available samples: 2 pigs per factor combination. Available microarrays: 12

54 Model Uncertainty One way to design the experiment: consider all factor combinations. This corresponds to a full factorial model. BUT possibly not all interactions are equally important: given large number of genes, some form of multi-factorial optimal design is natural; down-weight for higher order interactions.

55 Multi-factorial design optimality criterion Let Ey = X β, where β = (β 0,..., β p ) be the design and parametrization of a particular factorial model. Tsai, Gilmour and Mead (2000) suggested a Q-criterion, Q(X ) = 1 n 0 p p 1 a ii i=1 j=0 a 2 ij a ii a jj w ij, where {a ij } = X t X the information matrix, and w ij = no. of factorial submodels of X, which both include β i and β j. n 0 = total number of factorial submodels of X.

56 Q-criterion in microarray studies No intercept β 0 in the microarray log-expression ratio model. Slight modification of the Q-criterion: Q(X ) = 1 n 0 p p 1 a ii i=1 j=1 a 2 ij a ii a jj w ij. The old Q-criterion corresponds to including an explicit dye-effect.

57 Example: two-factor microarray experiment Let s focus on two lines of pigs whose ovary material is studied 2,3,4 and 6 days after inducing luteal regression. Because we study differential expressions, there is no intercept in the model, y ijk = α i + β j + (αβ) ij + ɛ ijk i = 1, 2, j = 1,..., 4 Not to favour any factor level, we choose the parametrization: α i = β j = (αβ) ij = 0. i j ij

58 Factorial submodels of two factor model α α + β α β β The total number of models n 0 = 4. The weight matrix is given by: w

59 Case 4: D-optimal designs for enzyme kinetic models Conditions = time Reaction A θ,λ B ODE representation: d[a] dt = θ[a] λ. with solution [A] = η(t, ψ) = { {1 (1 λ)θt} 1/(1 λ) λ 1 e θt λ = 1 Question: how to choose observations locations t 1,..., t k to optimize info about ψ = (θ, λ)?

60 Continuous designs and D-optimality The aim is to find an D-optimal design, { x1 x ξ = 2... x n w 1 w 2... w n }, whereby the x are the support points and the w the design weights. Solution independent of number of observations: w [0, 1]. Silvey (1980): optimal design is often concentrated on p design points, with weight 1/p (where p = # parameters). ξ is D-optimal, if it maximizes where F t = ( dη(t1,ψ) dθ dη(t 1,ψ) dλ det F t WF, dη(t 2,ψ) dθ dη(t 2,ψ) dλ So, optimal design depends on (guess) ψ 0. ) W = diag(1/2, 1/2)

61 Conclusions We have met several microarray design issues and concluded: 1. There are several design principles to obey when considering sources of variation of a microarray experiment. 2. carefully assigning samples to conditions can improve estimation: optimal designs are important in order to reduce cost/time interwoven loop designs are a good compromise 3. Further questions to consider: Technical replication Pooling Factorial designs kinetic modelling.

Topics on statistical design and analysis. of cdna microarray experiment

Topics on statistical design and analysis of cdna microarray experiment Ximin Zhu A Dissertation Submitted to the University of Glasgow for the degree of Doctor of Philosophy Department of Statistics May