OPTIMAL DESIGN OF EXPERIMENTS FOR EMERGING BIOLOGICAL AND COMPUTATIONAL APPLICATIONS

Size: px

Start display at page:

Download "OPTIMAL DESIGN OF EXPERIMENTS FOR EMERGING BIOLOGICAL AND COMPUTATIONAL APPLICATIONS"

Darcy Merritt
5 years ago
Views:

1 OPTIMAL DESIGN OF EXPERIMENTS FOR EMERGING BIOLOGICAL AND COMPUTATIONAL APPLICATIONS DISSERTATION Presented in Partial Fulfillment of the Requirements for The Degree Doctor of Philosophy In the Graduate School of The Ohio State University By Nilgün Ferhatosmanoğlu, B.S. The Ohio State University 2007 Dissertation Committee: Dr. Theodore T. Allen, Adviser Dr. Clark A. Mount-Campbell Approved by Graduate Program in Industrial and Systems Engineering Dr. David D. Woods

2 Copyright By Nilgün Ferhatosmanoğlu 2007

3 ABSTRACT This dissertation explores two types of applications of applied statistics techniques to develop methods associated with bioinformatics and information retrieval. The first type relates to planning probably the most common type of genetics related experiment, i.e., co-hybridized microarray testing. In these experiments, samples are paired and tested on specific slides using either red or green dyes. The question addressed concerns how to deploy the samples to slides and select dye colors in a manner that improves the sensitivity and specificity without increasing the associated cost. A generalized A-optimality criterion called the expected squared errors of coefficient estimates (ESECE) is proposed to aid in experimental design selection in this context. The proposed criterion also can be applied to any type of experimentation focused on parameter estimation. Heuristic methods to generate arrays using the proposed criterion are also suggested and the resulting so-called hybrid designs are described. It is revealed that the hybrid designs constitute a compromise between the reference designs used widely by practitioners and the loop designs explored in the literature. The proposed criterion and a study of 15,488 ii

4 genes together suggest that reference designs are generally likely to foster more accurate estimation than loop designs. Also, the proposed hybrid designs likely offer further benefits in increased sensitivity and specificity with no added costs. The second type of application explored is the design of vector space search engines, which constitute perhaps the most common type of search engine. This research also relates to bioinformatics in that the development of information retrieval and database techniques for specific challenges related to biology is an active area of research. Contributions here focus on the selection of weights used in the distance function by the search engine. Documents are selected by calculating their weighted distance to the vector associated with the query. In this dissertation, two types of methods are explored separately and also combined in an attempt to tune the selection of weights so that the search engine generates results of greater interest to users. The first type is so-called discrete choice analysis (DCA) methods which permit the estimation of weights that putatively maximize the expected utility of users in the context of specific queries. The second type of method is the application of mixture modeling often associated with food science. Based on the fitting of specific types of mixture regression models, methods are proposed to enhance the expected user utility for a variety of queries. The DCA methods are illustrated using a news database and simulated users. The associated test problems provide an indication that the proposed methods could improve performance compared with the common strategy of applying equal weights for all semantic dimensions. iii

5 Dedicated to my parents iv

6 ACKNOWLEDGMENTS Considering the several years of my educational journey, starting from the primary school towards a PhD, it is impossible for me to acknowledge the countless number of people who directly or indirectly supported me. I first thank my advisor, Dr. Theodore T. Allen, for all of his invaluable supervision, time and efforts throughout my PhD education. He has always been a good advisor and a good mentor to me. I learned how qualified research can be performed by his sincere help and noble and productive critiques during stressful times of study. I also appreciate the financial support provided by Edison Welding Institute. I owe special thanks to my committee members Dr. Clark A. Mount-Campbell and Dr. David D. Woods who greatly helped me improve my research with their directing advice and helpful questions during all stages of my candidacy and colloquium exams. I feel that no thanks would be enough for my dear husband, Hakan Ferhatosmanoğlu. But in words, I thank him; first for being a wonderful husband and a v

7 wonderful friend to me and to our dear son, Yavuz Fuat and to our dear daughter, Eda Zeynep. I love them all. Second, I thank him for all his love and encouragement that supported me in my long and stressful educational experience towards my PhD. I would like to give special thanks to my dear parents-in-law Mürvet and Bekir Ferhatosmanoğlu, for being a wonderful couple, for causing me to get married with my husband, and for sincerely supporting my education ideals. I thank my dear brother, Mesut and his dear wife, Cavide for all their love and support. Finally, I thank my dear parents, Yaşariye and Hayati Arslan for all kinds of support, especially for their invaluable efforts to build my enthusiasm of learning and education, which endures this day. Canım anneciğim ve babacığım, çok güzel bir aile olduğunuz ve bana en baştan beri en yükseği hedeflememi öğütlediğiniz için size çok teşekkür ediyorum. vi

8 VITA 1980 Born Sakarya, Turkey 2002 BS Industrial Engineering Bilkent University Ankara, Turkey Graduate Research Assistant Ohio State University Columbus, Ohio Present Ohio State University Columbus, Ohio PUBLICATIONS Research Publications 1. N. Chantarat, T. T. Allen, N. Ferhatosmanoglu, and M. Bernshteyn (2006), A Combined Array Approach to Minimize Expected Prediction Errors in Experimentation Involving Mixture and Process Variables, The International Journal of Industrial and Systems Engineering, Volume 1, Nos. 1/2, N. Ferhatosmanoglu, T. Allen. Optimal Design of Experiments for cdna Microarrays. INFORMS 2006 Annual Meeting, Pittsburgh, PA, Nov. 5-8, N. Zheng, T. Allen, and N. Ferhatosmanoglu. Fast Optimal DOE for Search Engine Technology. INFORMS 2006 Annual Meeting, Pittsburgh, PA, Nov 5-8, N. Ferhatosmanoglu, T. T. Allen, and U. V. Catalyurek, Optimal Design of Experiments for cdna Microarrays: A Novel Criterion for Comparing Loop and Reference Designs, (submitted to Bioinformatics Journal, Sep 2006). vii

9 FIELDS OF STUDY Major Field: Industrial and Systems Engineering Minor Field: Biomedical Informatics Minor Field: Human Factors and Cognitive Engineering viii

10 TABLE OF CONTENTS Abstract. ii Dedication.. iv Acknowledgments...v Vita... vii List of Tables... xii List of Figures. xiv Chapter 1 Introduction Problem Statements Overview... 3 Chapter 2 A Novel Experimental Planning Criterion For Co-Hybridized Microarrays Introduction cdna Microarray Experiments Basic Stages of Microarray Experimentation Sources of Variation in Microarray Experiments Experimental Constraints Graphical Representation of Experiments Experimental Design Options Dye-Swaps Reference Sample Designs Loop Designs Evaluation of Microarray Designs Models and Assumptions Design of Experiments for Microarrays Comparison of Loop versus Reference Designs The Minimum Variance Criteria and A-Optimality Bias and The Proposed Criterion (ESECE) Benefits of the ESECE Criterion (Generalized A-Optimality) Simulation Studies for Comparison of A-Optimality and ESECE Criteria Assumptions Simulation Set-ups Results and Discussion ix

11 2.7 Plant Leaf Case Study Experimental Assumptions Experimental Method Evaluation Results and Discussion Contribution of the Research to Biological Studies Experiments and Results Research Summary Conclusion and Future Work References Chapter 3 Optimal Experimental Design For Co-hybridized Microarrays Introduction Evaluation Criteria of Microarray Experiments Structure of Feasible Microarray Designs Number of Candidate Microarray Designs Enumerating All Designs Code to Optimize Using Exhaustive Enumeration A Heuristic Search Method Extension to Larger Studies and Hybrid Designs Conclusions and Future Work References Chapter 4 Discrete Choice Models For User-Centric Search Engines Introduction Proposed Methodology Assumptions and Key Ideas Method#1: Discrete Choice Analysis Weighting (DCAW) Method#2: Mixture Experiment Weight Optimization (MEWO) Online Weighting Updates Search Engine Method Evaluation Performance Measures Simulating Users Case Study News Database Discrete Choice Analysis Weighting (DCAW) Experiments Mixture Experiment Weight Optimization (MEWO) Online Update via Discrete Choice Analysis Conclusions and Future Work References Chapter 5 Conclusions And Future Research Overview Summary of Findings Limitations and Future Research x

12 Appendix A Appendix B Appendix C Appendix D Bibliography. 144 xi

13 LIST OF TABLES Table 1 Design Matrix format for T samples and n slides Table 2 Comparison of Loop versus Reference Design based on the min variance and ESECE Criteria...33 Table 3 Comparison of Loop versus Reference Designs in Figure 9 based on the min variance and ESECE Criteria Table 4 Comparison of the full design with eight slides and the sub-design with four slides Table 5 True and fitted coefficient estimates for gene 6508 based on the comparison in Table Table 6 True hit and false hit number and ratios for the sub-reference and sub-loop designs...46 Table 7 Number of Floating operations and the corresponding execution time for n samples and m slides...65 Table 8 Evaluation results for the designs in Figure Table 9 Evaluation results for the designs in Figure Table 10 Evaluation results for the designs in Figure Table 11 Evaluation results for the designs in Figure Table 12 Evaluation results for the designs in Figure Table 13 Comparison of interwoven loop and hybrid designs...86 Table 14 Overview of proposed methods...98 Table 15 Design selection ids for the first 20 users in 3 sets xii

14 Table 16 Updated weights for 100 users in 3 sets Table 17 The evaluation results for the updated weights Table 18 Queries used for the search engine experiments Table 19 True user weights, set Table 20 True user weights, set Table 21 True user weights, set Table 22 First stage estimated weights (cont d) xiii

15 LIST OF FIGURES Figure 1 An illustration of a cdna microarray experiment...8 Figure 2 Basic Stages of a microarray experiment...9 Figure 3 Sources of variation during a microarray experiment...12 Figure 4 Graphical representation of a microarray design...14 Figure 5 A Dye-swap replication...16 Figure 6 An illustration of a reference sample design...17 Figure 7 An illustration of a loop design...17 Figure 8 (a) Study 1: Loop design with X 1 (b) Study 2: Loop design with (X 1 X 2 ) (c)study 3: Reference design with X 1 (d) Study 4: Reference design with (X 1 X 2 )30 Figure 9 (a) Study 5: Loop design with X 1 (b) Study 6: Loop design with (X 1 X 2 )...31 Figure 10 (a) Study 9 and Study 10 based on the reference design with 4 samples and 12 arrays/slides, (b) Study 11 and Study 12 based on the loop design with 4 samples and 12 arrays/slides...32 Figure 11 Comparison of the loop and reference design in simulation studies 1-4 with various levels of model mis-specification or bias Figure 12 Comparison of the loop and reference designs in simulation studies 5-8 with various levels of model mis-specification or bias Figure 13 Comparison of the loop and reference designs in simulation studies 9-12 with various levels of model mis-specification or bias Figure 14 The plant leaf gene expression data with the associated sample comparisons38 xiv

16 Figure 15 (a) The reference designs and (b) the (incomplete) loop designs illustrated with the corresponding design matrices. Each design is a sub-design formed by comparing samples A, B, and C using 2, 3, 4, or 6 arrays (slides)...39 Figure 16 The comparison of the empirical squared error estimates with the predicted ESECE estimates for the sub-designs in Figure Figure 17. Information loss versus cost trade-off function...43 Figure 18 (a) The 4 th reference and (b) the 4 th loop design from Figure Figure 19 ROC curves based on the differentially expressed genes in the study...47 Figure 20 An infeasible microarray design violating the connectivity constraint...59 Figure 21 Illustration of the design matrices of isomorphic graphs...60 Figure 22 Optimal microarray designs from Shu and Wang, Figure 23 Design comparisons for 5 samples and 12 arrays...78 Figure 24 Example hybrid designs generated using MBCO Figure 25 Reference, interwoven and hybrid designs with 5 samples and 10 arrays...81 Figure 26 Reference, interwoven and hybrid designs with 5 samples and 12 arrays...82 Figure 27 Reference, interwoven and hybrid designs with 6 samples and 12 arrays...83 Figure 28 Reference, interwoven and hybrid designs with 6 samples and 14 arrays...84 Figure 29 Dye-swapped loop and hybrid designs with 7 samples and 16 arrays...85 Figure 30 Illustration of how weighting can affect effective semantic distances Figure 31 Illustration of Key Idea Figure 32 Illustration of Key Idea xv

17 CHAPTER 1 INTRODUCTION 1.1 Problem Statements Many have referred to the contemporary era as the age of genetic discovery and it seems possible that many of the diseases people currently suffer from can be cured or ameliorated through the study of genes. Further, it is estimated that billions of dollars each year are spent on genetic experimentation and much of that on a type of experimentation called co-hybridized microarray testing. In these experiments, samples are paired and assigned dye colors. It is widely known that the outputs of these experiments are noisy and, as a result, information derived can be unreliable. This follows despite attempts to build redundancy into specific microarray slides and much research on how to replicate experiments in a manner balancing the need for accuracy with cost constraints. A central question addressed in this dissertation is: How to plan better cohybridized microarray experiments fostering demonstrated higher sensitivity 1

18 and specificity with no added cost? A related question is: How to efficiently derive the resulting recommended hybrid designs for planning desirable cohybridized experiments? Another important advance of recent times is the search engine. It is becoming difficult to image how people previously lived their lives without the benefit of the World Wide Web and search engines such as Google and Yahoo. While the algorithms used by the most popular search engines are proprietary, a survey of the research literature indicates that so-called vector space search engines continue to enjoy the most attention. Another question addressed by this dissertation is: Can we create proof of concept verification that feedback from users can improve ratings of results from vector space search engines? Associated with this question is the possibility of developing a deeper theoretical understanding of how feedback from users can seamlessly be used to improve search engines. While the application topics addressed here encompass both genetic experimentation and search engines, the central theme is simple. This theme concerns the utility of concepts and techniques from the relatively mature area of applied statistics in the context of new fields. Specifically, we study two representative examples of these fields: i) planning co-hybridized microarray experimental design in biological sciences, and ii) setting the weights of the distance function for search engine design in computer and information areas. The methods from the applied statistics literature explored focus on design of experiments (DOE) techniques and a type of generalized linear modeling approach called discrete choice analysis (DCA). 2

19 1.2 Overview The remainder of this dissertation is organized as follows: Chapter 2 discusses the background information about the cohybridized microarray experiments; the literature on evaluating microarray designs, and introduces a novel criterion, expected squared errors of coefficient estimates (ESECE), to evaluate the microarray designs with more realistic assumptions than the criteria in previous literature. The performance of this criterion is compared to the other criteria through several simulation and real experiments. Chapter 3 discusses the issues related to finding the optimal microarray design based on the proposed criterion in Chapter 2. Results of the study of optimizing the ESECE criterion show that hybrid designs foster improved performances compared with the suggested experimental plans in the literature. Chapter 4 provides an experimental design framework based on two types of design of experiments methodologies for design settings of a tuned search engine designed to improve the relevance of search results. Chapter 5 summarizes the findings, limitations, and conclusions of the studies in both applications and discusses potential areas for future research. 3

20 CHAPTER 2 A NOVEL EXPERIMENTAL PLANNING CRITERION FOR CO-HYBRIDIZED MICROARRAYS 2.1 Introduction Microarray experiments are used to quantify and compare gene expression on a large scale in order to address complex scientific questions. Through gene expression profiling, microarrays provide deep insights to biologists who are interested in the molecular functionality of genes; including protein prediction, homology searching and expression analysis (Callow et al., 2000; Van t Veer et al., 2002). Consequently, there is increasing demand for statistical assessment of the conclusions drawn from microarray experiments. One drawback of high throughput data of microarray experiments is the significant amount of noise caused by the variability and the measurement errors during the experiment. Therefore, a search for development of 4

21 new computational and statistical tools to handle the great amount of complex multivariate microarray data is unavoidable in order to improve the efficiency and the reliability of the obtained data. As with all large-scale experiments, microarrays can be costly in terms of equipment, consumables and time. Therefore, considering the expensive and limited resources, the experiments need to be designed properly to obtain results that are maximally informative. The questions of interest and associated challenges, such as the sources of variability in the experimental setup, should be identified before utilizing the available resources carelessly. We focus on co-hybridized microrarray experiments where a set of glass slides are used, each with two messenger RNA (mrna) samples spotted over thousands of (usually) complementary DNA (cdna) probes, leading to further analysis of the relative expressions of the sample genes. The main issues that need to be addressed when planning such two-color microarray experiments include which samples (target, variety (Kerr, et al., 2001), treatment, condition) of study to hybridize on the same slide, whether to compare the samples directly or indirectly, and which samples to replicate, etc. These issues are studied considering the limitations of microarray experiments such as: the number of microarray slides, the amount of the available biological input from the samples, i.e. the mrna whose complementary DNA hybridized on the slides, the aim of the study regarding the primary comparisons of interest, etc. (Yang and Speed, 2002). 5

22 The most widely used experimental design within the biological community is the so-called reference design. In this design, each condition of interest is compared with samples taken from some standard reference. In an alternative design, called loop design, two conditions are compared through a chain of other conditions. Most theoretical papers on microarray design argue that the loop design of microarray experiments is more efficient than the reference design (Vinciotti et al., 2005, Landgrebe et al., 2004; Churchill, 2002; Glonek and Solomon, 2004; Kerr and Churchill, 2001; Khanin and Wit, 2004; Yang and Speed, 2002). These studies provide theoretical advantages of loop designs over reference design. Searching the criteria on which the above design evaluation research is built, we found out that all the theoretical studies on microarray design assume there is no bias in their evaluation of the designs, because all assume the model they are fitting is the true model. To overcome this common problem, we propose a novel efficiency criterion, Expected Squared Error of Coefficient Estimates (ESECE) as an alternative to the most widely used A-Optimality criterion suggested by Kerr et al. (2001), Vinciotti et al. (2005). The main difference between the two criteria is the A-Optimality assumes that there is no bias in the fitted parameters of the model i.e. the true model is assumed to be in the same form of the fitted model, whereas the ESECE criterion is built upon evaluating the fitted model in presence of bias. Therefore, ESECE gives a qualitative estimate of the fitted model based on both variance and bias. We propose the ESECE 6

23 criterion based on both simulated data and real microarray data which reverses the relative strengths between the loop and reference designs when incorporating bias into the evaluation of the efficiency of the designs. 2.2 cdna Microarray Experiments Complementary DNA or cdna microarray is a DNA chip or gene array which allows identification of gene expression levels in a biological sample. Sample is the biological unit that is used for the experiment, such as a tumor cell from a mouse or human. In a microarray experiment one must compare two samples simultaneously. By comparing samples, we mean, comparing expression levels of genes coming from the cell populations of two samples. In other words, a DNA chip is used to determine which genes are activated and which genes are repressed when two populations of cells are compared. Some biologists are interested in learning which genes are activated during a certain type of disease. One goal of these studies is addressing these activated genes for recovery from the disease. Genes from a tumor cell of a mouse are compared with the genes from a healthy mouse to see which genes are activated and which are repressed. If some of the genes from the tumor cell are differentially expressed from the genes from the healthy cells, they are selected as potential candidates for gene therapy. Let us go over the process briefly. The first step in preparing for the experiment is determining which genes of the samples to compare. Then fix these genes on the spots of the microarray by immobilizing. There are a large number of 7

spots depending on the type of the experiment around 10,000 to 40,000. There are already prepared microarrays for different type of the experiments.

24 spots depending on the type of the experiment around 10,000 to 40,000. There are already prepared microarrays for different type of the experiments. The source of the biological input used for the experiment is the pure mrna extracted from the samples mentioned above. After the pure mrna are extracted from the target cells of the samples, they are reverse transcribed to complementary DNA which can base pair to the DNA clones (genes) immobilized on the spots of the microarray. To identify which gene corresponds to which sample, the cdna from the two cells are colored with one of the two different dyes, green (Cy 3) or red (Cy 5). After this labeling process, the colored cdna strands are pooled together and poured onto the microarray. Some colored cdna are bound to spots or hybridized with its pair. The unhybridized cdna is then washed off. In order to detect which DNA is bound to spots, the microarray is exposed once to the green laser and once to the red laser. After that, the scanned red and green images are merged. Figure 1 An illustration of a cdna microarray experiment 8

25 As a result what is observed is a colorful scene of spots with different combinations of red and green signal intensity (see Figure 1, taken from Blalock, 2003). A spot containing more red means the genes from the tumor cells are activated more or expressed differentially or vice versa. Therefore, those are the target genes of study Basic Stages of Microarray Experimentation Microarray experiments involve several pre- and post-experimental stages. Figure 2 (from illustrates these fundamental stages. Pre- experimental stages involve 1. Planning an experimental design based on the samples to be used for the experiment. Figure 2 Basic Stages of a microarray experiment 9

26 2. Tethering candidate DNA sequences or genes on all the spots of the microarray. 3. Extraction of mrna samples from the cells of the samples of study. 4. Reverse-transcription of the extracted mrnas to complementary DNA sequences. 5. Labeling of the cdna sequences with one of the green or red fluorescent dyes. 6. Pooling of the labeled cdnas all together. 7. Hybridization of the cdna sequences to the immobilized DNAs on the spots of the microarray. After these stages are accomplished the unhybridized cdna is washed off, and to examine the abundance of the cdna sequences on the spots, the microarray is heated and exposed to green and red laser scanning processes as shown in Figure 2. Then, all the same processes are repeated for all the sample comparisons stated in the particular experimental design of the microarray experiment. Once all the experiments and replicates of the experiments are conducted, investigators turn to data analysis stages. The differential green and red signal intensities shown in Figure 2 are quantified for each spot (or gene) and then the data goes over many normalization techniques which are the study of the post-experimental analysts of the microarray experiments. After all these stages, the data is finally ready for several statistical tests 10

27 identifying which genes are expressed differentially among the samples. Results of these statistical tests highlight the scientific discoveries of the biologists and the pharmacists Sources of Variation in Microarray Experiments As noted in section 2.2.1, there are many sources of variation that effect the differential expression of the genes of the samples of study. Since microarray experiments have multiple sources of variation, experimental plans should ensure that effects of interest are not confounded with ancillary effects. The known or estimated sources of variation in experimental setup need to be carefully considered and managed for the effective design and analysis of microarray experiments. The two-dye system, as opposed to a single fluorescent, has been popularly used in microarray experiments. The relative red and green intensities from a spot are used to generate the data to be further analyzed. As seen in Figure 3 (taken from Churchill, 2002), three types of sources of variation are observed during the layers of the experiment. In the first layer, the experiment is exposed to biological variation under the influence of genetic or environmental factors which are native to all organisms. Using biological replicates addresses this type of variation. The second layer depicts technical variations introduced during the extraction, labeling, and the hybridization of the samples. This variation can be addressed by doubling the extraction of the mrna from the samples in an independent environment, and using both of the dyes for each replicate. In the 11

28 bottom layer there is the potential for measurement error with reading the signals which can be affected by the dust on the array, etc. Duplicating the spots might help to reduce this type of error. Figure 3 Sources of variation during a microarray experiment Valid statistical tests of the microarray experiments are usually based on the biological and technical variations. It is expected that effective designs consider replication at these two types of variation. The third type of error, the measurement errors are typically introduced while identifying and quantifying the dye signals which is a sensitive process (Churchill, 2002). 12

29 2.2.3 Experimental Constraints Yang et al., 2002 suggests various scientific and physical constraints that a microarray should satisfy. Below is an itemized list of these constraints: aim of the experiment: identify differentially expressed genes, search for specific gene- expression patterns, identify tumor subclasses amount of mrna available: affects the replicate slides possible number of slides available: limits the number of hybridizations made experimental process before hybridization controls planned: positive, negative, ratio, and others and verification method. The scientific constraints include the aim of the experiment, the prioritization of the specific questions to be answered, etc. The physical restraints are mainly the amount of mrna available for the experiment, which limits the number of replications and the number of slides, which limits the number of hybridizations that can be made. For a given aim of an experiment, one can come up with a best design which is inefficient for any other experiment. Therefore, it is important to pay attention to various constraints of an experiment (see Churchill, 2002). It is valuable to note that when conducting statistical experiments in these contexts, the most important constraints to consider are the amount of mrna available and the number of slides used for the experiment. 13

30 2.2.4 Graphical Representation of Experiments Microarray experiments can be represented graphically by using multidigraphs or directed graphs where the nodes or vertices of the graphs represent the target mrna samples used in the study and the arrows or edges correspond to the arrays or slides used for comparing the associated samples. These digraphs efficiently convey all the information of a microarray experiment and are helpful in statistical experiments of microarray data. Figure 4 represents the graphical representation of a microarray experiment using 3 samples and 3 slides with the corresponding design matrix. Figure 4 Graphical representation of a microarray design Based on the figure above, an arrow in the graph shows one array experiment and the direction of the arrow corresponds to dye labeling. The head of the arrow represents a green label and the tail of the arrow represents a red label. The nodes of the arrow are the samples that are compared on the array. Accordingly, a -1 in the 14

31 design matrix corresponds to a green label and a 1 represents red label. The samples that have a 0 are the ones that are not included in that particular experiment. The constraints of the design matrix of a microarray experiment are: 1. Each row in the design matrix should have exactly one 1 and one -1 and the rest are 0s, because you can compare only two samples on one array. This constraint makes the graph directed. 2. Also the graph of the microarray experiments should be connected, which means each sample should be compared to at least one other sample. This is obligatory since the experimenter is interested in all the sample comparisons, though some of the sample comparisons may be of higher interest. Therefore, the microarray designs graphically produce directed and connected graphs. 2.3 Experimental Design Options There is an exponential number of design options of a microarray experiment for a specific number of samples and slides available- based on the constraints explained for a feasible directed and connected graph of a microarray. (For detailed discussion of this subject, refer to Chapter 3 of the dissertation). The most popular design options are discussed in the section below. 15

32 2.3.1 Dye-Swaps A B Figure 5 A Dye-swap replication In Dye-swap replications, each hybridization between two samples is performed twice with the dye labels reversed (see Figure 5). Most cdna microarray experiments show systematic differences in the red and green intensities, which require correction at the normalization step (Yang and Speed, 2002). In order to reduce the residual color bias, dye-swap experiments play a significant role. Replication with the same labeling may cause the color bias accumulate. When indirect comparisons are used such as in reference designs, the residual color bias is removed as in the following equation (Yang and Speed, 2002) proposes; [log (A/R) + residual color bias] [log (B/R) + residual color bias] = log (A/R) log (B/R) Reference Sample Designs The Reference Design, being the most widely used experimental design, is formed where each sample is compared to a standard reference with the same dye labeling of the reference. As illustrated in Figure 6, samples B and C are hybridized with A consecutively, A always labeled with the green dye. 16

33 A B C Figure 6 An illustration of a reference sample design There are some disadvantages and advantages of the reference design. One disadvantage is the possibility of a dye effect is being confounded with treatment effects. Making dye-swaps between the reference and the samples can address this issue by reducing the technical variation. Some advantages of these designs include the single step needed to add any other sample to the experiment, and how there are only two pathways for comparing every combination of the two experimental samples Loop Designs A B C Figure 7 An illustration of a loop design 17

34 Another popular design in biological studies is loop designs. In the loop designs the samples are compared in consecutive pairs both with the green and red dye labeling in a chain structure (see Figure 7 for an illustration of a loop design). Advantages of loop designs are that the dye effects are not confounded with the treatment effects as in the reference design, also more direct comparisons can be made per the number of slides. For a specific number of samples, Kerr et al. proves that the loop designs are more efficient than the reference designs based on A- Optimality. One disadvantage of the loop design experiment is that it depends upon the number of samples; where large loops having long indirect paths between samples may produce inefficient and imprecise estimates of the intensity log ratios of the hybridized samples. Another disadvantage is the difficulty of extending an experiment by adding another sample. 2.4 Evaluation of Microarray Designs Kerr et al. in 2001 are the founders of the first statistical model for gene expression in microarrays. Since then, research has been evolving on statistically evaluating the microarray designs, and finding the best design for a given number of samples and slides Models and Assumptions Kerr and Churchill (2001) proposed the first statistical model for gene 18

35 expression in microarrays. Specifically, Kerr et al. suggested the following Analysis of Variance (ANOVA) model, including the factor main effects, and the interaction effects of variety-gene, array-gene (spot effects) and dye-gene: zijkl = μ + α + κ + τ + γ + αγ ) + ( σγ ) + ( τγ ) + η i j k l ( (1) il jl kl ijkl where z ijkl is the fluorescent intensity from array run i and dye j representing variety k and gene l in the log scale, and µ = mean effect, α = array effect, κ = dye effect, τ = variety effect, γ = gene effect and η = random error. With regard to experimental planning, the above model could be fitted if the same set of genes is spotted on each array in an expensive experiment including all combinations of samples, dyes, and pairings replicated more than one time. More recently, research has been evolving to consider models that emphasize the comparison of expression of different samples and models with fewer terms to permit more economical testing Design of Experiments for Microarrays Based on the ANOVA model in (1), Landgrebe et al. (2006) developed a linear model for the difference of the log signal intensities, z ijkl from two samples on each experiment for each gene: y = X 1 β 1 + ε (2) where y i = zirk' zigk and 2 ε ~ N(0, σ ) i I n 19 for i = 1, n

36 and where X 1 is the design matrix deriving from the experimental plan and model and β 1 is the parameter vector including the dye and variety effects. Note that the gene notation from the model in (1) has been dropped in (2), and g and r in z ijk (j=g, r) stand for the dyes green and red corresponding to the samples k and k compared on array i. We follow the model in Vinciotti et al. (2005) in the same format in (2) with the assumption of β 1 as the hypothetical expression difference between conditions leaving dye effects out from X 1 as in Table 1. In Table 1, the entry -1 indicates the corresponding sample labeled with the green dye, and a 1 indicates the corresponding sample labeled with the red dye. The design matrix built for T samples has to include T 1 columns to be non-singular, since every row is composed of one 1 and one -1, and the rest of zeros. The remaining column is known. slide τ 1 τ 2 τ 3 τ T M 1 M -1 M n M M 0 M Table 1 Design Matrix format for T samples and n slides. Once one builds the design matrix for any design, one can run statistical calculation experiments to evaluate so-called criteria for different designs. In the following sections, A-Optimality and ESECE criteria are discussed in detail and used for the comparison of the efficiency of the microarray designs. 20

37 2.5 Comparison of Loop versus Reference Designs In this section, we discuss about the mostly used criteria in the literature and the ESECE criterion as an alternative to be assessed in a comparison of the most popular microarray designs, i.e. reference sample and loop designs. After discussing these criterions, the results of 12 simulation studies conducted on various microarray designs are provided with different quantities of samples and slides The Minimum Variance Criteria and A-Optimality We refer to Yang and Speed (2002) and Vinciotti et al. (2005) for variancebased evaluation of a given microarray design. For k=a, B, and C in Figure 7, let θ ijk = E (z ijk ) and β kk ' = θ θ for array i (=1, 2, 3). In order to estimate the parameters in irk ' igk vector β 1, model in (2) can be fitted where: X 1 = β β 1 = β AB AC. (3) As usual, the least-squares estimates and the corresponding variance-covariance matrix of estimates: ˆ ' 1 ' β = (X X ) X y and σ ( X ' 1 X1) (4) where the latter quantity is often called the information matrix. For a given design, this matrix is commonly used to evaluate experimental design matrices before experimentation begins. 21

38 Yang and Speed (2002), Kerr and Churchill (2001), and Vinciotti et al. (2005) all investigated criteria based on the assumption that the fitted model forms are the true model forms. Specifically, they focused on evaluation of alternative approaches using the A-Optimality criterion. This criterion is the average variance of the parameters which can be written as: σ 2 Tr[(X 1 X 1 ) 1 ] (5) where Tr is the trace or the sum of the diagonal values. Vinciotti et al. (2005) also explained the need to ensure that the same parameterization is used in modeling alternative designs which we also assume Bias and The Proposed Criterion (ESECE) Again using n as the number of slides or experimental runs of the experimental plan, the experimental plan D is an n (T 1) array. Assume that the log ratio signal intensities from experiments, y, derive from the following model: y = X 1 β 1 + X 2 β 2 + ε (6) where X 1 is n (T-1) design matrix, β 1 is a vector containing the true values of the fitted coefficients. Also, X 2 is an n (T-1) design matrix composed of the interaction contrasts, e.g., see Fig. 2 (b), β 2 is the corresponding vector of additional terms in the 22

39 hypothetical true model, and ε is an n vector of experimental random errors with standard deviation, σ. Yet, because of limitations on the number of slides n, the model fitted to estimate the differential expression parameters of a given gene is still: ŷ = X1 βˆ 1 (7) where βˆ 1= (X 1 X 1 ) 1 X 1 y. The assumption scheme in equations (6) and (7) is more realistic than the scheme in equations (2) and (4) because it includes the possibility of lack of fit, i.e., important interactions, X 2 in the true model (6) that are not in the fitted model (7). Allen, Yu, and Schmitz (2003) proposed the Expected Integrated Mean Square Error (EIMSE) criterion that minimizes the expected mean square error which focuses on the prediction of responses at novel conditions. Allen, Bernshteyn, Yu, and Kabiri (2003) show that EIMSE optimal designs produce relatively low prediction errors in the context of computer experiment case studies from the literature. The EIMSE has already been used to achieve useful engineering results as described in Allen, Yu, and Schmitz (2003), Allen, Yu, and Bernshteyn (2000). Yet, in modeling based on microarrays, prediction for only discrete gene types is of interest and accurate estimation of mean expression differences, i.e., the β 1 model parameters, are of primary interests. To achieve the best results from a microarray experiment, β 1 estimates or the contrast estimates should be accurate, i.e., the total errors (variance plus bias) of the β 1 23

40 estimates should be as minimized. Accurate estimation of the differences in β 1 can reasonably be expected to reduce the chance of so-called false positives and negatives. These occur when errors in estimating the sample effects lead to incorrect conclusions about whether genes are differentially expressed between different samples. The proposed Expected Squared Error of Coefficient Estimates (ESECE) criterion is the sum of the expected squared errors associated with estimation for a given design, D. The proposed criterion is: ESECE (D) [(β βˆ ) (β βˆ )] (8) Ε,β,ε β 1 2 where βˆ 1 ' 1 ' = (X X ) X y, y = X 1 β 1 + X 2 β 2 +ε, β 2 ~ MN [μ 2, Β 2 ], and ε ~ N[0,Iσ 2 ]. B 2 is the prior covariance matrix of the missing coefficients (e.g., see Allen, Yu, and Schmitz, 2003) and MN refers to the multivariate normal distribution. Note that assumptions about β 1 are not needed because of a cancellation in the top two lines of the formula in equation (8) independent of assumptions about β 2. The following formula permits computationally efficient evaluation under the often relevant assumption that μ 2 = 0: ESECE(D) = Tr[(X 1 X 1 ) 1 (X 1 X 2 )B 2 (X 1 X 2 ) (X 1 X 1 ) 1 ] + σ 2 Tr[(X 1 X 1 ) 1 ] (9) 24

41 The assumption μ 2 = 0 is generally reasonable since these are coefficients that one is not fitting and hopes are near zero in magnitude and are neither more likely to be positive or negative. The formula based on μ 2 0 is more complicated and a topic for future study. Also, it might be of interest to quote ESECE (D) number of total model terms to give the expected average squared error. Also, the ESECE can be called generalized A-optimality or generalized A- efficiency because the second part of equation (9), σ 2 Tr[(X 1 X 1 ) 1 ] is A-optimality. The first part represents the bias between the assumptions of the fitted and the true model. If one makes the optimistic assumption B 2 = 0, the ESECE reduces to A- optimality. Below we give a derivation of Equation 9: Min Ε β β ε[( β1 βˆ 1) (β1 βˆ 1)] 1, 2, 1 1 Min Ε β (X X ) X y] [β (X X ) X y] β [ 1, β2, ε When we plug in y = X1 β1 + X 2β 2 + ε, we get the following equivalence Min Ε β, β ε[( β1 βˆ 1) (β1 βˆ 1)] Min Ε ] 1 2 ε[α Α, β1,β2, where 1 1 A = β = + 1 βˆ 1 β1 β1 [(X1X1) X1X 2β 2 (X1X1) X1ε 1 1 so that = [(X X ) X X β + (X X ) X ε] Α

42 2 Further, assuming β 2 ~ MN [0, B 2 ] and ε ~ MN [0, I σ 2 ], then A ~ MN [0, var (A)] where the independence of β 2 and ε is used and the zero mean assumption. Under these assumptions, one has: var (A) =(X 1 X 1 ) 1 (X 1 X 2 )B 2 (X 1 X 2 ) (X 1 X 1 ) 1 +(X 1 X 1 ) 1 X 1 I 2 σ 2 X 1 (X 1 X 1 ) 1. Since E[K TK] = trace[tvar(k)]+ E[K ]TE[K] for any random vector of K and any symmetric matrix T, the following equivalence can be accomplished: Min Ε β, β ε[( β1 βˆ 1) (β1 βˆ 1)] Min Ε ] 1 2 ε[α Α, β1,β2, = Min { trace [ I var( A)] + E [ A] IE[ A]} Since the second term is a constant, E(A) = 0, it can be eliminated from the objective function, and we get the following objective, by inserting the equation for var(a): = Min { trace [(X 1 X 1 ) 1 (X 1 X 2 )B 2 (X 1 X 2 ) (X 1 X 1 ) 1 ] + trace [(X 1 X 1 ) 1 X 1 I 2 σ 2 X 1 (X 1 X 1 ) 1 ]} = Min { trace [(X 1 X 1 ) 1 (X 1 X 2 )B 2 (X 1 X 2 ) (X 1 X 1 ) 1 ] + 2 σ 2 trace [(X 1 X 1 ) 1 ]}. As a result, we have obtained a deterministic optimization formula as a function of B 2 which is the covariance matrix of the vector β 2. 26

43 2.5.3 Benefits of the ESECE Criterion (Generalized A-Optimality) Benefits of this criterion more intuitive and beneficial for microarray experimental planners include: ESECE derives from relatively realistic assumptions and thus likely aids in reducing identification errors the user achieves. Like A-optimality, the ESECE is computationally easy to calculate, i.e., it does not require Monte Carlo simulation which might be needed by more direct measures of false positive and negative error rate criteria. By including A-optimality as the special case with B 2 = 0, the ESECE is associated with the large body of literature associated with A-optimality. The main limitation of the ESECE is indeterminacy associated with the need to pick a specific prior covariance matrix, B 2, to assume. This issue is shared with the EIMSE criterion from Allen, Yu, and Schmitz (2003). A central conclusion from that research is that while results do depend on the specific assumptions used, arbitrarily setting B 2 = 0 (i.e., using A-optimality in the present case) is generally not a desirable choice. In our experiments, we picked B 2 = I so that the missing coefficients are comparable in size to the individual random errors, ε i. The B 2 = I assumption is consistent with assumptions used by Allen, Yu, and Schmitz (2003). It is also consistent with (but not restricted to) generating the missing coefficients, independent, identically distributed with standard deviation equal to 1. In some cases, the experimenter might be interested in particular estimates of the contrasts for particular genes that are expected to be differentially expressed and 27

44 has nontrivial information from prior experiments. Since the general ESECE definition in equation (8) does not depend on any assumptions about β 1, it can be used to estimate the differential expression of any genes. In general, prohibitively large numbers of test runs would be needed to directly estimate both β 1 and β 2. Yet, any estimates for the distribution of β 2 can be plugged into equation (8) and a new formula derived for selecting which experiments to perform. 2.6 Simulation Studies for Comparison of A-Optimality and ESECE Criteria In this section, the ESECE criteria and its special case A-optimality are used to evaluate alternative designs in 12 evaluation studies effectively simulating the fitting of the linear model form y ˆ = X ˆ 1 β for the simulated microarray data Assumptions Since an analytical formula is available in equation (9) numerical simulation is not needed to estimate the average squared errors of estimation. Using the formula, one can effectively simulate the microarray log intensity data in two sets. The first without considering bias similar to the minimum variance criteria, the second set considers the bias on which the proposed ESECE criterion is built. In other words, for the first set of the evaluation study, the microarray data is effectively simulated based on the same model form as the fitted model, i.e., y = X 1 β 1 + ε for both of the loop and the reference designs. For the second set of two evaluations, we generated the 28

45 microarray data assuming it had the model form y = X 1 β 1 + X 2 β 2 + ε, having more terms than the fitted model, with the associated prior covariance matrix equal to the identity, i.e., B 2 = I Simulation Set-ups The first four simulation studies based on 4 designs and the corresponding design matrixes are given in Figure 8. In Simulation Study 1 and 3, the linear regression model, y = X 1 β 1 + ε is fit for the loop design and the reference design, assuming there is no bias, meaning that the assumed true model is the same as the fitted model. However for the Simulation Study 2 and 4, we added bias by adding two interaction terms to the assumed true model, and fit the same regression model as above. Figure 8 shows the four experimental designs being evaluated. The figure also shows tables that are effectively design matrices. For example, removing the column with the heading run and the header row, Figure 8 (a) is a type of loop design and the associated X 1 matrix, i.e., with no interaction columns included. Figure 8 (b) shows the same experimental plan with interaction terms included in the associated design matrices which can be written using the concatenation (X 1 X 2 ). 29

46 Figure 8 (a) Study 1: Loop design with X 1 (b) Study 2: Loop design with (X 1 X 2 ) (c)study 3: Reference design with X 1 (d) Study 4: Reference design with (X 1 X 2 ) Including interaction terms in the assumed true model, the log-ratio intensity level becomes affected by the combined effects of the genes with both the samples and the dyes. This modeling captures both the relative dye-gene effects and the samplegene effects (Kerr et al, 2001) across the two varieties. 30

47 Figure 9 (a) Study 5: Loop design with X 1 (b) Study 6: Loop design with (X 1 X 2 ) (c) Study 7: Reference design with X 1 (d) Study 8: Reference design with (X 1 X 2 ) In general, the X 2 matrix can be generated automatically for any experimental design including more complicated reference and loop designs. Using this new parameterization, the additional effect of adding more terms to the assumed true model 31

48 of the log-ratio intensities across two samples is modeled with additional design matrix columns, X 2. In this matrix, a 1 is assigned if the samples are hybridized on the same slide, and a 0, otherwise (see Figure 8 (a) and 8 (b)). The evaluation study based the on four designs and the corresponding design matrixes are given in Table 2 and Figure 11. In all studies, the linear regression model, ŷ (x) = f 1 (x) ˆβ 1 was fit for the loop design and the reference design. Figure 9 shows the other four simulation studies conducted in the same fashion for reference and loop design, this time with more slides. The results are shown in Table 3 and Figure 12. Figure 10 (a) Study 9 and Study 10 based on the reference design with 4 samples and 12 arrays/slides, (b) Study 11 and Study 12 based on the loop design with 4 samples and 12 arrays/slides. 32

49 Four more simulation studies were conducted based on the designs in Figure 10, where both the number of samples and arrays are increased. Evaluation results for these simulations share a similar pattern with the previous studies and are given in Figure Results and Discussion Table 2 shows the evaluation results for the first four simulation studies. Based on the results it is observed that, with no bias included, the loop design has lower ESECE than the reference design. This finding echoes the results in Vinciotti et al. (2005). Yet, when two interaction terms are added to the assumed true model of the simulated microarray data, the ESECE estimate of the loop design gets higher while there is no observed change in reference design performance. ESECE (B 2 = 0) or A-efficiency ESECE (B 2 = I) Loop 0.60 (Fig. 2 a) 0.86 (Fig. 2 b) Reference 0.75 (Fig. 2 c) 0.75 (Fig. 2 d) Table 2 Comparison of Loop versus Reference Design based on the min variance and ESECE Criteria. The above results in Table 2 are based on the assumption with B 2 = 1.0 I. Consider all assumptions of the form B 2 = γ I where γ is the bias coefficient. When the bias 33

50 coefficient is set to greater values, the loop design tends to have much higher ESECE values, while the reference design stays robust. Figure 11 shows the comparison of the two designs under increasing levels of bias, i.e., the physical situation is characterized by larger interactive effects. Of course, when the bias coefficient is zero, the ESECE is A-optimality, so comparisons in other papers are generally based on the γ = 0 axis loop reference ESECE (var+bias) bias coefficient Figure 11 Comparison of the loop and reference design in simulation studies 1-4 with various levels of model mis-specification or bias. 34

51 When the number of arrays is increased from 4 arrays to 8 arrays such as in the designs of Figure 9, performing the same type of 4 experiments, the evaluation results were observed in Table 3 for the same assumptions as before, i.e., B 2 = γ I. ESECE (B 2 = 0) or A-efficiency ESECE (B 2 = I) Loop (Fig. 2 a) (Fig. 2 b) Reference 0.25 (Fig. 2 c) 0.25 (Fig. 2 d) Table 3 Comparison of Loop versus Reference Designs in Figure 9 based on the min variance and ESECE Criteria. Based on the results above, it is observed that with no bias included, the loop design has lower ESECE than the reference design as expected. However, when two interaction terms are added to the assumed true model of the simulated microarray data, the ESECE estimate of the loop design gets higher than the reference design similar to observations from the previous set of experiments. The comparison results in Table 3 are given for the bias coefficient set to 1. We also provide the results for different levels of bias coefficient in Figure 12. The results again show similarity for these experiments. When the bias coefficient is set to greater values, the loop design tends to have much higher ESECE values, while the reference design stays robust. 35

52 Figure 12 Comparison of the loop and reference designs in simulation studies 5-8 with various levels of model mis-specification or bias. We have conducted four more simulation studies for the reference and loop designs illustrated in Figure 10, where both the number of samples and the number of arrays are increased at the same time. The evaluation results for different levels of bias coefficient are given in Figure

53 ESECE (var+bias) Reference Loop bias coefficient Figure 13 Comparison of the loop and reference designs in simulation studies 9-12 with various levels of model mis-specification or bias. Based on the results in 12 simulation studies, it is noted that when the criterion used to evaluate the microarray designs is changed considering different assumptions, the relative performance order may change between the popular designs. More specifically, when the ESECE or Generalized A-Optimality criterion is assessed instead of the A-Optimality criterion, the reference designs tend to foster better performances than the associated loop designs with the same number of samples and the same number of arrays. At least this is shown for the particular designs stated in this chapter. 37

54 2.7 Plant Leaf Case Study In this study, the relationship of the proposed criterion with real data from the literature is considered. The case study is Cauliflower mosaic virus (CaMV) gene VIinduced host responses in transgenic Arabidopsis in Stanford Microarray Database (genome-www5). Let A = wild type col-0 leaves, B = w260 transgenic plant leaves, and C = d4 transgenic plant leaves. In this case, sufficient data is available that it is possible to consider what would happen if only a subset of the data were collected using a loop or reference design. In total, there were eight slides of normalized microarray experiment data, four of which are the log ratio(r/g) from the hybridization of the samples A&B, and the other four from the hybridization of the samples B&C (see Figure 14). All these eight slides are formed as repeated dye-swaps between the mentioned pairs. Each slide has the same print and gene allocation of 15,488 genes (see Appendix A for a sample list of log-ratio data). A B B C Figure 14 The plant leaf gene expression data with the associated sample comparisons Experimental Assumptions Using subsets, thirteen experimental designs were derived and shown in Figure 15. Five of these experimental designs belong to the family of reference designs and 38

55 eight of them belong to the family of loop designs; as we can assume, they are incomplete loop designs, with one-way directed label assignments (for various examples of designs, see Yang and Speed, 2002). Figure 15 (a) The reference designs and (b) the (incomplete) loop designs illustrated with the corresponding design matrices. Each design is a sub-design formed by comparing samples A, B, and C using 2, 3, 4, or 6 arrays (slides). Given the experimental designs, corresponding design matrices are built (see Figure 15), and assumed the fitting of the model form in (7) for each data set in the particular experimental design, a total of forty four data sets of 15,488 genes. 39

56 2.7.2 Experimental Method To find the empirical error estimates for these designs, a methodology is devised for finding a reasonable assumed true model. The model is fit to the whole data of eight-slide experimental designs, including the interaction effects between samples a & b and b & c, since there is no comparison made between a & c. Parameter estimates for each of thirteen designs are compared, with each design having fewer arrays with the parameter estimates of the assumed true model fitted to the whole data. The difference becomes an error estimate which includes both the variance of the parameters and the bias in the assumed true model. Then, the average estimated squared errors (AESE) can be calculated as the averaging the squared error of the fitted coefficients as follows: AESE = average{σ genes Σ contrasts (β kk β kk,est ) 2 } (10) where β kk ' = θ ' θ irk igk (see section 2.5.1). To illustrate the empirical ESECE calculation, we picked the 6508 th gene which achieved the median AESE value of all the 15,488 genes. This 6508 th gene is a putative RAS super-family GTP-binding protein. We used the coefficient values fitted to the full DOE in Table 4 to approximate the true expression values. Table 4 also shows the runs included in the sub-doe number 2. Table 5 contains the assumed true coefficient estimates and the fitted coefficient estimates for this gene based on the comparison of the designs shown in Table 4. 40

57 Table 4 Comparison of the full design with eight slides and the sub-design with four slides. Slide Included in Full DOE Included in Sub-DOE 1 1 (A B) Yes 2 (B A) Yes Yes 3 (A B) Yes 4 (B A) Yes Yes 5 (C B) Yes 6 (B C) Yes Yes 7 (C B) Yes 8 (B C) Yes Yes Assumed true coefficients Fitted coefficient estimates Squared error β ab β ac β bc Table 5 True and fitted coefficient estimates for gene 6508 based on the comparison in Table Evaluation Results and Discussion Empirical average estimated squared errors (AESE) values were calculated for all 15,488 genes and 13 design of experiments arrays from Figure 15. Figure 16 plots those averages versus the theoretical ESECE estimates which could be obtained before any experimentation using the formula in Equation (9). Based on the results of the experiments it is observed that (1) the errors predicted by the ESECE formula in 41

58 advance of any experimentation are roughly proportional to the empirical errors, and (2) the reference designs generally achieved lower empirical errors for fixed ESECE. Avg. Empirical Squared Errors (AESE) Loop Ref erence Predicted Squared Errors Figure 16 The comparison of the empirical squared error estimates with the predicted ESECE estimates for the sub-designs in Figure 15. As a result, for equal numbers of slides, parameters of the reference designs seem to be more effective in representing the assumed true model than that of all the loop designs as predicted by the ESECE criterion. The reference and loop sub-designs are also evaluated on an information loss and cost trade-off function as follows: Let Information loss (i) = ESECE (i arrays) - ESECE (6 arrays) {i = 2,.,4} Cost = no of arrays unit cost (array cost + biological sample material cost) where the estimated unit cost is typically approximately $1K. 42

59 Figure 17 shows the information loss and cost trade-off function for the sub-designs with 2, 3 and 4 arrays compared with the 6-array design. Based on the figure, it is observed that for the same amount of cost, the reference designs foster lower information loss than the corresponding loop designs with reference to the information gained using a 6-array design. information loss ref loop experimental cost (array+biological unit) Figure 17. Information loss versus cost trade-off function 43

60 2.8 Contribution of the Research to Biological Studies We have proposed the ESECE or Generalized A-Optimality criterion as an alternative to the most widely used A-optimality criterion. A- Optimality is the average variance of the parameters of the model under optimistic assumptions. Intuitively, if the variance of the parameters is bigger, or the accuracy of the beta estimates is low, then this means the percentage of the variance of the signal intensities explained by the model is lower than the variance explained by the random errors. Therefore, the accuracy of the results or in other terms the false hit or miss rate depends on the generalized variance of the parameters. This relation is discussed in more detail in Vinciotti et al. (2005) on pp , and a related study is illustrated in Fig. 4 on p As ESECE is a generalized form of A-Optimality that addresses both the variance and bias of the beta parameters, it seems likely that similar results apply for this criterion as well. In other words, it is surmised that the designs performing better under the ESECE criterion produce gene expression data with higher true hit rate and less false hit rate Experiments and Results Based on the discussion above, we have done experiments with the real microarray data from the plant leaf case study discussed in section 2.7. We have used the two designs from Figure 15, one from the family of reference designs and one from the family of loop designs illustrated in Figure

61 a b Figure 18 (a) The 4 th reference and (b) the 4 th loop design from Figure 15. This experimental study includes comparing the differentially expressed genes identified by the gene expression log/ratio data using the sub-designs illustrated in the above figure, with the (assumed) really differentially expressed genes identified by the full DOE (8-array microarray design) shown in Figure 14. We have chosen the subdesigns that are conducted using 3 arrays, used for comparing the associated samples. Our empirical ESECE values shown in Figure 16 state that the reference sub-design has an ESECE estimate of whereas the loop sub-design has an ESECE estimate of We have identified the number of differentially expressed genes among 15,488 gene expression data for the two sub designs and the full design. Then all the genes that are found to be differentially expressed are compared. The true hits and the false hits are then calculated for each sub-design. These processes are all repeated for 6 different threshold levels of differential expression. The results of all the experiments together with the corresponding levels are shown in Table 6. 45

62 # of True hits # of False hits True hit ratio False hit ratio threshold ref loop ref loop ref loop ref loop Table 6 True hit and false hit number and ratios for the sub-reference and sub-loop designs Based on the experiment results in the table above, we have built the ROC (Receiver Operating Characteristic) curves for the two designs shown in Figure 19, where the x-axis corresponds to the false hit rate or 1-Specificity and the y-axis corresponds to the true hit rate or Sensitivity values of the reference and loop designs based on different thresholds of differential expression of the genes used in the study. The results state that the ROC curve corresponding to the reference design is higher and closer to the y-axis border than the curve corresponding to the loop design. 46

63 This shows that the differentially expressed genes based on the reference designs are more accurate than the ones of the loop designs. These results are correlated with the empirical ESECE estimates discussed in section True Positive Rate (Sensitivity) reference loop False Positive Rate (1-Specificity) Figure 19 ROC curves based on the differentially expressed genes in the study. 2.9 Research Summary In this chapter, a criterion is proposed to predict the errors that the user of an experimental design will likely achieve before experimentation. This criterion also includes the effects of model misspecification or bias. Based on all the simulation and real data experiments, if bias is incorporated into the calculation of the efficiencies of 47

64 the experimental designs, reference designs seem likely to foster lower estimation errors in general based on the differential gene expression estimates than the loop designs. In particular, the following results are derived during the performance of the experiments: The ESECE criterion is a generalized form of A-Optimality criterion since it gives the same results with the A-Optimality when the bias coefficient is set to 0. It is shown that when the proposed criterion used for comparison of two designs is changed, the hierarchy between the designs might be reversed. When bias is incorporated into the efficiency model, the loop design seems likely to foster higher prediction errors than reference designs. The ESECE criterion was validated in that the designs predicted to be more likely to foster lower (or higher) prediction errors did foster lower (or higher) prediction errors related to our plant leaf case study and to empirical estimates of these errors. In contrast to some practices in the literature, it is believed that the reference designs and the loop designs should be compared having both equal number of samples and equal number of slides. Otherwise, the efficiency comparisons about the experimental designs are not fair. Based on this argument, both in the simulation and real data evaluations, it was assumed that one of the samples is the reference sample. 48

65 One limitation of this empirical comparison is that it was performed on incomplete loop designs. This is a concern since the bias effects in general might be higher if the interaction between the samples of the missing arrays could be observed. Such a result would enlarge the gap between the efficiency of the reference and the loop designs. Despite the missing arrays, eight sets of loop designs have been observed to be less desirable than the corresponding reference designs. The ESECE was able to capture the efficiency differences and the drawbacks of the loop designs. This shows the value of the proposed ESECE criterion Conclusion and Future Work This chapter discusses Expected Squared Error of Coefficient Estimates (ESECE) criterion which is designed to evaluate the vulnerabilities of any given cdna microarray experimental design based on the likely errors fostered. Using the proposed criterion, the merits of traditional reference designs are exposed compared with loop designs. These studies show how the proposed ESECE criterion can be used to generate novel hybrid plans, i.e., designs that are neither loop nor reference designs. Further extensions of the current research can consider additional factors such as perturbations to the biological environments, the location of specific genes on arrays, and the use of additional dyes. Lastly, meta-analysis of new data combined with data from existing databases of results can improve estimates of dye, gene, and sample effects. 49

66 2.11 References Allen, T. T., Yu, L. and Bernshteyn, M. (2000) Low Cost Response Surface Methods Applied to the Design of Plastic Snap Fits. Quality Engineering, 12, Allen, T. T., Yu, L. and Schmitz, J. (2003) An Experimental Design Criterion for Minimizing Meta-Model Prediction Errors Applied to Die Casting Machine Design. Journal of the Royal Statistical Society Series C: Applied Statistics, 52(1), Blalock, E. M. (2003) A Beginner s Guide to Microarrays, Kluwer Academic Publishers. Brown, P. O. and Botstein, D. (1999) Exploring the New World of the Genome with DNA Microarrays. Nature Genetics, 21(1 Suppl), Callow, M. J., Dudoit, S., Gong, E. L., Speed, T. P. & Rubin, E. M. (2000) Microarray expression profiling identifies genes with altered expression in HDLdeficient mice. Genome Res., 10, Campbell, M. J., Machin, D. (2002) Medical Statistics, John Wiley & Sons, Ltd. Causton, H. C., Quackenbush, J., Brazma, A. (2003) Microarray Gene Expression Data Analysis, Blackwell Publishing, UK. Churchill, G.A. (2002) Fundamentals of experimental design for cdna microarrays. Nature Genetics, 32, Glonek, G. F. V. and Solomon, P. J. (2004) Factorial and time course designs for cdna microarray experiments. Biostatistics, 5, Kerr, M. K. & Churchill, G. A. (2001) Experimental design for gene expression microarrays. Biostatistics, 2, Kerr, M. K. and Churchill, G. A. (2001) Statistical design and the analysis of gene expression microarray data. Genet. Res., 77, Khanin, R. and Wit, E. C. (2005) Near-optimal designs for dual channel microarray experiments. Journal of the Royal Statistical Society: Series C (Applied Statistics), 54(5),

67 Landgrebe, J., Bretz, F. and Brunner, E. (2006) Efficient design and analysis of two colour factorial microarray experiments. Computational Statistics and Data Analysis, 50(2), Rosa, G. J. M., Leon, N. D., and A. J. M. Rosa (2006) Review of microarray experimental design strategies for genetical genomics studies. Physiological Genomics, 28: Van t Veer, L. J. et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, Vinciotti, V., Khanin, R., D'Alimonte, D., Liu, X., Cattini, N., Hotchkiss, G., Bucca, G., de Jesus, O., Rasaiyaah, J., Smith, C., Kellam, P., and Wit, E. (2005) An experimental evaluation of a loop versus a reference design for two-channel microarrays. Bioinformatics, 21(4), Yang, Y. H. and Speed, T. (2002) Design issues for cdna microarray experiments. Nat. Rev. Genet., 3, Yang, X. (2003) Optimal Design of Single Factor cdna Microarray Experiments and Mixed Models for Gene Expression Data. Dissertation, Faculty of the Virginia Polytechnic Institute and State University. Internet Source: Internet Source: ExpDesign_Intro.htm 51

68 CHAPTER 3 OPTIMAL EXPERIMENTAL DESIGN FOR CO- HYBRIDIZED MICROARRAYS 3.1 Introduction In the previous chapter the expected squared errors of coefficient estimates (ESECE) or generalized A-optimality criterion was proposed. It was demonstrated that this criterion can be used to predict the estimation errors prior to experimentation and to evaluate whether additional samples will likely be desirable. Also it is observed that ESECE depends on relatively reasonable assumptions that include bias effects of the fitted model caused by interactions between the sample and the dye. In this chapter methods for generating optimal ESECE designs are proposed, which refer to as hybrid designs because can be viewed as a compromise between 52

69 reference and interwoven loop designs. By improving the ESECE criterion, the generated hybrid designs seem likely to foster greater design accuracies than alternatives designs based on the same numbers of samples and slides. Kerr et al. (2001) argue that it is reasonable to mimic the patterns of optimal designs generated for relatively small numbers of samples and slides (e.g., 5 samples, 10 slides) for manual generation of larger designs. In their paper, they provide a list of designs for n samples and 2n arrays which belong to the family of interwoven loop designs. These designs putatively maximize A-efficiency for the smaller cases and appear to be competitive for the larger cases. Similarly, Witt et al. (2005), in their near-optimal design paper, compare a simulated annealing heuristic with the interwoven loop designs. They state that the interwoven loop designs generally result in higher objective values than the designs produced using simulated annealing because of the difficulty of the associated integer programs. Witt et al. (2005) focused on so-called L-efficiency rather than Aoptimality but both are variance based criteria unlike the ESECE criterion which includes bias. We suggest that other heuristic approaches such as genetic algorithms that can be applied to the microarray domain. For example, Hadj-Alouane and Bean in 1997 propose a random-keys approach for genetic algorithms which could be explored in future research. In this chapter two array generation methods are also proposed: an algorithm based on enumeration that is guaranteed to generate global ESECE optimal designs and a heuristic motivated by the Kerr et al. (2001) approach. The proposed heuristic is 53

70 apparently useful for relatively large design generation problems. Section 3.2 reviews optimal experimental design criterion and the relationship of the ESECE to previous criteria discussed in the literature. Section 3.3 describes constraints on the set of possibly acceptable experimental designs and exploits these constraints to derive an upper bound on the computation times for efficient enumeration methods. Section 3.4 describes code for an admittedly inefficient method for enumerating feasible designs which is still viable for many cases of potential interest to biologists. In Section 3.5, a heuristic method motivated by the patterns of optimal designs for relatively small numbers of samples and factors is described. This method explores designs associated with zero bias and explicitly minimizes the prediction variance. Section 3.6 compares the resulting class of putatively optimal hybrid designs with reference and interwoven loop designs. Section 3.7 summarizes the findings and describes opportunities for future research. 3.2 Evaluation Criteria of Microarray Experiments Commonly discussed criteria in the literature include: A-Optimality, D- Optimality, and G-Optimality which all relate to (X 1 X 1 ) 1, which describes the variances of the estimated coefficients under the assumption of zero bias. Specifically: (i) A-optimality minimizes Trace[(X 1 X 1 ) 1 ] (n 1) (11) which is proportional to the average variance of the estimated coefficients. 54

71 (ii) D-optimality targets minimizing det[(x 1 X 1 ) 1 ] which results in minimizing the joint confidence ellipsoidal volume of the estimated coefficients β where σ 2 (X 1 X 1 ) 1 gives the covariance matrix of β (assuming the errors are normally distributed). (iii) G-optimality minimizes max{v i }, which is the maximum standardized prediction variance, where 1 v i = x i (X 1 Χ ) x is the prediction variance standardized by σ 2 1 i obtained by the prediction point, x i. Yang and Speed (2002), Kerr and Churchill (2001), and Vinciotti et al. (2005) all investigated these variance-based criteria (A, D, and G optimality) which are associated with the assumption that the fitted model forms are in fact the true model forms. Next, we review the Expected Squared Errors of Coefficient Estimates (ESECE) criterion from Chapter 2. Assume that the log ratio signal intensities from experiments, y, derive from the following model: y = X 1 β 1 + X 2 β 2 + ε (11) where X 1 is n (T-1) design matrix and where n is the number of slides or experimental runs of the experimental plan, D that is an n (T 1) array. β 1 is a vector containing the true values of the fitted coefficients. X 2 is an n (T-1) design matrix composed of the interaction contrasts, and β 2 is the corresponding vector of additional terms in the hypothetical true model, where ε is an n vector of experimental 55

72 random errors with standard deviation, σ. Yet, because of limitations on the number of slides n, the model fitted to estimate the differential expression parameters of a given gene is still: ŷ = X 1 ˆβ 1 (12) where ˆβ 1= (X 1 X 1 ) 1 X 1 y. Allen, Yu, and Schmitz (2003) proposed the Expected Integrated Mean Square Error (EIMSE) criterion that minimizes the expected mean square error which focuses on the prediction of responses at novel conditions. Allen, Bernshteyn, Yu, and Kabiri (2003) show that EIMSE optimal designs produce relatively low prediction errors in the context of computer experiment case studies from the literature. The EIMSE has already been used to achieve useful engineering results as described in Allen, Yu, and Schmitz (2003), Allen, Yu, and Bernshteyn (2000). Yet, in modeling based on microarrays, prediction for only discrete gene types is of interest and accurate estimation of mean expression differences, i.e., the β 1 model parameters, are of primary interests. To achieve the best results from a microarray experiment, β 1 estimates or the contrast estimates should be accurate, i.e., the total errors (variance plus bias) of the β 1 estimates should be as minimized. Accurate estimation of the differences in β 1 can reasonably be expected to reduce the chance of so-called false positives and negatives. These occur when errors in estimating the sample effects lead to incorrect conclusions about whether genes are differentially expressed between different samples. The 56

73 proposed Expected Squared Error of Coefficient Estimates (ESECE) criterion is the sum of the expected squared errors associated with estimation for a given design, D. The proposed criterion is: ESECE(D) [( β βˆ ) ( β ˆ )] (n 1) (13) Ε,β,ε β1 β 1 2 where βˆ ' 1 1 1) ' 1 1 = ( X X X y, y = X 1 β 1 + X 2 β 2 + ε, β 2 ~ MN[μ 2,Β 2 ], and ε ~ N[0,Iσ 2 ]. and where (n 1) has been added to the definition to normalize the values in a manner similar to A-optimality. We refer to ESECE (n 1) as average ESECE. B 2 is the prior covariance matrix of the missing coefficients (e.g., see Allen, Yu, and Schmitz, 2003) and MN refers to the multivariate normal distribution. Note that assumptions about β 1 are not needed because of a cancellation in the top two lines of the formula in equation (14) independent of assumptions about β 2. The following formula permits computationally efficient evaluation under the often relevant assumption that μ 2 = 0: ESECE(D) = {Tr[(X 1 X 1 ) 1 (X 1 X 2 )B 2 (X 1 X 2 ) (X 1 X 1 ) 1 ] + σ 2 Tr[(X 1 X 1 ) 1 ] } (n 1). (14) The assumption μ 2 = 0 is generally reasonable since these are coefficients that one is not fitting and hopes are near zero in magnitude and are neither more likely to 57

74 be positive or negative. The formula based on μ 2 0 is more complicated and a topic for future study. Also, it might be of interest to quote ESECE(D) number of total model terms to give the expected average squared error. Also, the ESECE can be called generalized A-optimality or generalized A- efficiency because the second part of equation (15), σ 2 Tr[(X 1 X 1 ) 1 ] is A-optimality. The first part represents the bias between the assumptions of the fitted and the true model. If one makes the optimistic assumption B 2 = 0, the ESECE reduces to A- optimality. 3.3 Structure of Feasible Microarray Designs Details of the graphical representation of the microarray designs were discussed in Chapter 2. To review, an arrow in the microarray graph shows one array experiment and the direction of the arrow corresponds to dye labeling. The nodes of the arrows show the samples compared on the array. Figure 4 from chapter 2 shows a 1 in the design matrix corresponds to green label and a -1 represents red label shown in the figure. The samples that have a 0 are the ones that are not included in the particular experiment. Using this notation, we can itemize the effective feasibility constraints of a microarray design as follows: 1. Each row in the design matrix should have exactly one 1 and one -1 and the rest are 0s, because one can compare only two samples on a single array. This constraint makes the graph directed. 58

75 2. To insure estimability of all of the key contrasts it can be checked that the graph of the microarray experiments must be connected. This means that each sample should be compared to at least one other sample, and there must exist a direct or indirect path between all the pairs of the samples. For example, see Figure 20 below illustrating an infeasible microarray graph, where every sample is compared to at least one sample but there doesn t exist a path between samples A and B or A and D; or C and B or C and D. In the example in Figure 20, the experimenter is not able to compare sample A and B or sample C and D. Satisfying the connectivity constraint ensures that all sample comparisons are theoretically possible. C D A B Figure 20 An infeasible microarray design violating the connectivity constraint 59

76 3. The rows of the design matrices of the microarray graphs can be interchangeable. This means the order of the experiments does not matter from a statistical point of view. So the graphs are isomorphic, see Figure 21 for the design matrices of two same microarray graphs. Figure 21 Illustration of the design matrices of isomorphic graphs Therefore, feasible microarray designs graphically produce directed and connected graphs. These constraints cause many infeasible and redundant designs during enumeration of all possible designs for a given number of samples and arrays. If the feasibility constraints are not satisfied for a microarray design, the information matrix becomes singular and there is no unique solution found for the variance calculation Number of Candidate Microarray Designs Considering the effective feasibility constraints of microarray designs, an upper bound for the total number of feasible microarray designs are devised as follows. Define the number of samples used for the study as n and the number of slides used for comparing these samples as m. For the connectivity constraint, the 60

77 61 number of arrays should be at least n-1. The number of allocations to (n-1) rows for a minimal connected graph has an upper bound as n n n C C, where C refers to the combination function. This follows since a graph must include at least (n-1) rows with the minimal comparisons of every sample in order to be a connected graph. The remainder of the m-(n-1) rows can be allocated by any of the permutations but since the replications of the rows are also accepted, the total number becomes + + 1) ( 2 n m n m n P C, which together generates the resulting formula as + + 1) ( n m n m n P C n n C C n (15) for n samples and m slides. The bound in equation (16) is not tight because, in some cases two designs being enumerated could be equivalent except for an ordering of the runs. For a related enumeration study, Hou and Torney, 2000 found a general formula for the number of labeled connected graphs with n nodes (samples) and m edges (slides) as the following:

78 k = 1.. n k n1 + n nk = n, ni > 0 ( 1) k + 1 i= 1,., k n! s ni! C m n i, where s = C i,.., k 2 Equation (16) is more relevant, however, since the microarray case involves directed and connected graphs Enumerating All Designs In this section, the amount of time required to enumerate effectively feasible designs is considered. First, the number of floating point operations for each evaluation of the expected squared errors of coefficient estimates (ESECE) criterion is approximated. Next, this information is combined with the bound in equation (16) to yield time estimates and illuminate for which cases enumeration is effectively feasible. A main concern in enumeration is the number of floating point operations required for identifying the optimal microarray design. Since all the feasible designs of microarrays consist of all binary variables; 1, -1 or 0, the enumeration of the designs does not involve any floating-number operation. Therefore, the only part with the floating operations is the evaluation of the designs. Let me first review the evaluation formula (15) derived for a microarray design, D. ESECE(D) = Tr[(X 1 X 1 ) 1 (X 1 X 2 )B 2 (X 1 X 2 ) (X 1 X 1 ) 1 ] + σ 2 Tr[(X 1 X 1 ) 1 ] where B 2 and σ 2 are the only floating numbers. 62

79 For n samples and m slides, X 1 is a m by (n-1) matrix, and X 2 (the interactioneffects matrix) is a m by [n(n-1)/2] matrix, if all the pair wise interactions are included in the true model. Depending on the compared samples and the assumptions about the interaction terms, matrix sizes may vary. Here the focus is (somewhat arbitrarily) placed on the number of operations for a single ESECE calculation of a microarray design involving [n(n-1)/2] interaction terms. When multiplying a (n-1) by m matrix with a m by [n(n-1)/2] matrix, multiplying one row with a single column includes [n(n-1)/2] multiplications and [n(n-1)/2]-1 summations which forms one element of the product matrix. Since there are (n-1)[n(n-1)/2] elements, the total number of operations become (n-1)[n(n-1)/2][n(n-1)]-1] for multiplication of these matrices. Returning to the evaluation criterion itself, the following number of floating operations results, assuming that B 2 has distinct elements, and all the interaction effects are included in X 2. For n samples and m slides, there are: n n 1) 2 2 ( ( n n 1) + ( n 1) ( n n 1) + n floating-number operations for a single ESECE calculation of a microarray design. Since the number of feasible microarray designs is approximately n C C 2 2 n 1 n 1 n P + m n C 2 ( m n + 1) (see equation 16), an approximate upper bound on the number of floating operations needed to evaluate all the feasible designs based on the ESECE criterion is: 63

80 ) ( n m n m n P C n n C C n [ n n n n n n n n + + 1) ( 1) ( 1) ( 2 1) ( ]. After the evaluation step, a comparison should be made to identify the optimal design with the minimum ESECE. Since the assignment of a floating number is also a floating operation, the total number of comparisons including the initial assignment of a temporary variable requires that the total number of feasible designs to be added to the above number. Finally, + + 1) ( n m n m n P C n n C C n [ n n n n n n n n + + 1) ( 1) ( 1) ( 2 1) ( ] ) ( n m n m n P C n n C C n approximates the total number of floating operations executed to find the optimal design with n samples and m slides. This number is O( 2 n m n n ) for m >> n, which is exponential in type, and for large m and n, so that optimization using the associated exhaustive enumeration is N-P hard. Optimistically assume that a 3GHz computer can compute a single floatingpoint operation in every cycle. Then, one can estimate the total execution time of our optimal experimental design solution for a couple of realistically sized problems. For example, for 3 samples and 3 slides, the total number of floating operations required is estimated to be:

81 65 = + + 1) ( n m n m n P C n n C C n ] 1) ( 1) ( 1) ( 2 1) ( [ n n n n n n n n + + 1) ( n m n m n P C n n C C n = ) 20 ( C C C C = 3,888 or fewer. Assume that execution of a single floating operation requires seconds. Then, the total execution time for finding the optimal design with 3 samples and 3 slides is approximately 1.296x 6 10 seconds. Table 7 shows time estimates derived using this approach showing the estimated number of Floating Operations (FO) and the estimated Total Execution Time (TET). Table 7 Number of Floating operations and the corresponding execution time for n samples and m slides

82 The approximate bounds in Table 7 indicate that for numbers of samples and slides such as 8 by 10 exhaustive enumeration is not feasible. Yet, for many possible cases of interest enumeration is feasible. 3.4 Code to Optimize Using Exhaustive Enumeration In this section, computer programs for enumeration of feasible designs to yield expected squared errors of coefficient estimates (ESECE) optimal designs are presented. These programs are written in Matlab and the notation is altered to facilitate code writing and manipulation. The number of samples is nsample and the number of available arrays is narray. The functions in the program include: newdoe, doe, gendesign, sete, isdesignvalid, computeesece, and restartcreate. The initial function used in execution is newdoe written as follows. function newdoe(nsample, narray) if narray < nsample-1 sprintf('with %d samples, you need at least %d arrays (you have only %d).\n', nsample, nsample-1, narray) else current = 0; currentmin = 0; currentmax = (nsample*(nsample-1))^ narray - 1; best = zeros(narray, nsample); bestval = inf; doe(nsample, narray, current, currentmin, currentmax, bestval, best); end end 66

83 Function newdoe initalizes the main parameters, by setting the initial design matrix to all-zeros, and the best value to infinity. Each design is represented as a matrix with narray number of rows, and nsample number of columns. After initialization, function doe is called with these values. The function doe enumerates all valid DOEs based on nsample and narray values and identifies the ESECE optimal design. Specifically, doe iterates all permutations of the rows of the matrix E (generated by function sete), which has the list of designs for a single array. In each iteration (current:currentmax), narray number of selections made from the matrix E. The vector A keeps the chosen entries corresponding identifiers ( ids ) in E. These ids are used to select the corresponding designs from E by function gendesign. Note that through an enumeration of all possible choices, one would generate (nsample*(nsample-1))^(narray) designs, involving redundant combinations. The approach of cycling through all permutations of E is admittedly inefficient and used for simplicity only. More efficient procedures based on the concepts associated with equation (16) are proposed for future study. 67

84 function valid = doe(nsample, narray, current, currentmin, currentmax, bestval, best) E = sete(nsample) max = nsample*(nsample-1); valid = 0; amax = [0]; amax(1) = 1; for i = 2:narray+1, amax(i) = amax(i-1) * max;, end amax % int A[Row]; for a = current:currentmax for i = 1:narray A(i) = 1 + mod(floor(a / amax(i)), max); end % for debuging only % A % D = gendesign(nsample, narray, A, E) if (isdesignvalid(nsample, narray, A, E)) % 'VALID' valid = valid + 1; D = gendesign(nsample, narray, A, E); newesece = computeesece(d, nsample, narray); if (newesece < bestval) best = D; bestval = newesece; end else % 'Invalid' end if (mod(a, 5000)==0) restartcreate(nsample, narray, a, currentmin, currentmax, bestval, best); end end restartcreate(nsample, narray, a, currentmin, currentmax, bestval, best); sprintf('%d valid designs out of %d searched\n', valid, currentmax-currentmin+1) sprintf('bestesece = %f\n with design: ', bestval) best end 68

85 The generated designs are not stored explicitly for efficiency. Each design is checked for effective feasibility (see section 3.3) by the function isdesignvalid. If the design is effectively feasible, its ESECE value (computer by function computeesece) is compared with the current best value. For each run of 5,000 designs, the current results are written to a file to monitor the progress. In practice, the effective feasibility check generally creates a bottleneck such that the bound in equation (16) does not apply to this method. The minimum ESECE value (bestval) and the corresponding best design found so far is kept, and returned at program termination. The function gendesign generates a DOE by using the given narray number of ids (in the vector, A) to choose corresponding array designs from matrix E. function D = gendesign(nsample, narray, A, E) for r=1:narray, D(r,:) = E(A(r),:);, end end The function sete generates a two-dimensional matrix E composed of a list of all possible designs for a single array. The matrix E is first set to all-zeros, then each row is iteratively changed through updating the two of the nsample positions to 1 and

86 function E = sete(nsample) E = zeros(nsample*(nsample-1), nsample); a = 1; for i = 1:nsample for j = 1:nsample if i ~= j E(a,i) = 1; E(a,j) = -1; a = a + 1; end end end end As mentioned previously, the function isdesignvalid tests whether a generated design is effectively feasible (see section 3.3). The function returns the binary number valid, which equals 1 if the design is valid, 0 otherwise. The function iterates over all columns, and checks if any of the columns is all -0s. Whenever such a column is found the column iteration is broken with valid set to 0. If the iteration is completed without such a column, the function returns the built-in value 1. The implemented validity criterion is the inclusion of all samples in the experiment. The validity check could be extended to check the connectedness of the graph, or even to eliminate the isomorphic graphs (redundant experiments). However, these additional checks might not be justified since a computation of the ESECE value could conceivably be performed faster than graph operations. 70

87 function valid = isdesignvalid(nsample, narray, A, E) valid=1; for c=1:nsample zcol = 1; for r=1:narray if E(A(r),c)~=0 zcol = 0; break end end if zcol valid = 0; break end end end Function computeesece outputs the average ESECE value given a microarray design composed of n samples and m arrays. First, the initialization of B 2 to the (n-1) (n-1) identity matrix and the standard deviation, σ to 1 are done. Then, the average ESECE is set to 0, and the matrices X 1 and X 2 are set to m (n-1) zero matrices. Given a design matrix X, the matrices X 1 and X 2 that are formed using the X matrix are used for ESECE calculation. Therefore, the computeesece function begins with the formation of these matrices. To calculate the average ESECE value, the ESECE value should be calculated for every direct or indirect comparison of all the samples. Equation 15 gives the total ESECE for one sample s comparison with the all remaining samples, since each diagonal element of the resulting matrix corresponds to bias + variance of the compared samples. 71

88 % inputs X: design, n=nsample, m=narray function ESECEavg = computeesece(x, n, m) B2= eye(n-1); sigma=1; ESECEavg=0; X1=zeros(m,n-1); X2= zeros(m,n-1); for k=1:n a=x(:,k); for p=1:(k-1) X1(:,p)=X(:,p); end for r=k:(n-1) X1(:,r)=X(:,r+1); end for i=1:m for j=1:n-1 if(a(i,1)*x1(i,j))==-1 X2(i,j)= 1; else X2(i,j)= 0; end end end X1T = X1'; X1TX1 = X1T * X1; if det(x1tx1)==0 ESECEavg = Inf; break else y1= inv(x1tx1); y2= X1T*X2; y3= y2'; bias= trace(y1*y2*b2*y3*y1)/(n-1); var= sigma*trace(y1)/(n-1); ESECE=bias+var; ESECEavg=ESECEavg+ESECE; end end if ESECEavg<Inf ESECEavg=ESECEavg/n; end end 72

89 At its heart, the enumeration in computeesece can be written: Form the X 1 matrix by excluding the column corresponding to the compared sample from the total X matrix. Form the X 2 matrix by inserting a 1 for the compared interaction, if the corresponding samples are compared on the particular slide, 0 otherwise. Compute ESECE (bias+var) and divide by (n-1). Iterate for all the effectively feasible designs. The function restartcreate produces two files restart.m and currentbest.txt in every 5000 design evaluations. Consider that an unexpected disruption could occur during runtime or the execution of the codes might be stopped after a long time. To address these possibilities, the restartcreate function guarantees to restart the execution from the last design evaluated. Additionally, if one wants to run one part of the code in different computers, restartcreate provides the ability to run a portion of the designs between current and currentmax. bestval keeps the best ESECE value found until a particular time and the corresponding design is stored in the currentbest.txt file. 73

90 function restartcreate(nsample, narray, current, currentmin, currentmax, bestval, best) st = sprintf('_%ds_%da', nsample, narray); fname = sprintf('bestdesign%s.txt', st); save(fname, '-ascii', 'best'); frstname = sprintf('restart%s.m', st); fid = fopen(frstname, 'wt'); fprintf(fid, 'nsample = %d;\nnarray = %d;\ncurrent = %d;\ncurrentmin = %d;\ncurrentmax = %d;\nbestval = %f;\n', nsample, narray, current, currentmin, currentmax, bestval); line = sprintf('best = load(''%s'');\n', fname); fprintf(fid, line); fprintf(fid, 'doe(nsample, narray, current, currentmin, currentmax, bestval, best);'); fclose(fid); As examples, we applied the above functions to generate minimum expected squared errors of coefficient estimates (ESECE) optimal designs for n = 3 and 4 samples and 6 and 8 slides respectively. The resulting designs (not shown) are simply rings with dye swaps between every node, i.e., every link in the ring contains both a forward and backward arrow. 3.5 A Heuristic Search Method In this section, the minimum bias constrained optimization (MBCO) heuristic search method is proposed for generating putatively ESECE optimal designs for cases in which the number of slides is even. The proposed method is motivated by the following theorem. 74

91 Theorem: A sufficient condition to minimize the expected bias in the context of the constraints described in Section 3.3 is the inclusion of dye swaps for every connection between nodes. Proof: The first term in the ESECE formula in equation (15) is the expected bias. Consider that the matrix product of X 1`X 2 yields a zero matrix for the elements corresponding to the dye-swapped samples. Then, the bias in the ESECE formula equals zero for all designs including a full complement of dye-swaps. Since the expected bias cannot be negative, the theorem is proven. The minimum bias constrained optimization (MBCO) heuristic search method is constrained by the sufficiency condition in the theorem because it only considers designs having a full complement of dye swaps. With minimum bias attained, the algorithm then attempts to minimize the average variance criterion (A-optimality) which is equivalent in this case to the ESECE. Formally, MBCO can be written: Minimum Bias Constrained Optimization Method Step 1: If the number of slides or arrays is 2n, identify the minimum variance A-optimal design using n slides only. Step 2: For each slide in the solution of step 1, add another slide representing its dye swap (arrow in the opposite direction). 75

It can be checked that this method generates the same designs as the enumeration method described previously for the n = 3 and 4 sample cases described in Section 3.4. Computationally, the benefit of MBCO is clear.

92 It can be checked that this method generates the same designs as the enumeration method described previously for the n = 3 and 4 sample cases described in Section 3.4. Computationally, the benefit of MBCO is clear. The 2n dimensional search space is reduced to a 1n dimensional search space. This makes the enumeration methods described previously viable for larger problems. Also, minimum variance designs such as interwoven loop designs can be used to generate larger designs that are of interest. Accordingly the current study is preliminary in that many cases have not been thoroughly explored. However, so far no identified designs are identified as having improved ESECE criterion values compared to the designs generated using the MBCO method. For example, consider the so-called interwoven loop designs, which are putatively A-optimal (see Kerr et al., 2001) and the designs from Shu and Wang, 2005 in Figure 22. We tentatively suggest that the designs having double the numbers of slides (2m) through the inclusion of all dye swaps are putatively ESECE optimal. a n = 6 m = 8 n = 10 m = 12 b n = 5 m =10 n = 10 m = 20 Figure 22 Optimal microarray designs from Shu and Wang,

Topics on statistical design and analysis. of cdna microarray experiment

Topics on statistical design and analysis of cdna microarray experiment Ximin Zhu A Dissertation Submitted to the University of Glasgow for the degree of Doctor of Philosophy Department of Statistics May