Systematic Variation in Genetic Microarray Data

Size: px

Start display at page:

Download "Systematic Variation in Genetic Microarray Data"

Barnaby McGee
6 years ago
Views:

1 Biostatistics (2004), 1, 1, pp 1 47 Printed in Great Britain Systematic Variation in Genetic Microarray Data By KIMBERLY F SELLERS, JEFFREY MIECZNIKOWSKI, andwilliam F EDDY Department of Statistics, Carnegie Mellon University, Pittsburgh, PA ksellers@statcmuedu Summary The main focus in cdna microarray analysis is determining which genes are differentially expressed Scientists apply known statistical methods to model the structure of the experiment or develop new approaches for assessing statistical significance and assume that the data consist of the signal plus random noise Here, we report the results of some exploratory analyses of such data that show the existence of sources of significant systematic variation which are not necessarily accounted for in standard analyses In particular, we construct a linearization procedure and compare its effectiveness with that of Yang, et al (2001) Furthermore, we consider not only the variation due to the pin/print-tip as in previous work, but also the row and column location on the microarray chip, and the relative location from the well-plate Removal of this extra variation can affect both the size of differential gene expression, and which genes are inferred to be differentially expressed Some key words: Systematic variation, Analysis of Variance, Microarray analysis, Differential gene expression, Gene detection

2 2 Sellers, Miecznikowski, and Eddy 1 Introduction The human genome consists of DNA sequences located within the nucleus of each cell Specific DNA sequences are copied (transcribed) into messenger RNA (mrna) These mrna copies then move from the nucleus to the cytoplasm of the cell so that the corresponding sequence can be used to manufacture various protein molecules Genetic microarray technology makes use of this process (Brown and Botstein, 1999; Duggan et al, 1999; Lockhart and Winzeler, 2000) Target complementary DNA (cdna) elements are laid out on a glass slide and are probed with dye-labeled samples The target cdna elements are generated in advance using the polymerase chain reaction and are physically arrayed in a two-dimensional grid on a chemically-modified glass slide Then, in the two-dye method we are considering here, equal amounts of two purified mrna samples are separately reverse transcribed using primer sets labeled with two different fluorescent dyes The two resulting dye-labeled samples are used as probes in a competitive hybridization reaction with the target elements on the chip After hybridization, a scanner generates two images of the chip at the wavelengths of light corresponding to the two dyes The images are processed to generate a single number corresponding to each sample for each spot on the chip Dudoit, et al (2000) and Yang, et al (2002), for example, compare processing and analysis methods used on the images that generate the microarray data The raw data in this work are the values resulting from such image processing Genetic microarray data of this type has become an important object of statistical research; see, eg, Kerr, et al (2000) and Efron, et al (2001) These analyses focus on deciding which genes are differentially expressed They apply known statistical methods to model the structure of the experiment or develop new approaches for assessing statistical significance and assume that the data consist of the signal plus random noise Here we ask the question, Is this assumption

3 Systematic Microarray Data Variation 3 correct, or are the data in fact contaminated with systematic variation (which may reduce the signal-to-noise ratio to the point of obscuring significant differences)? Section 2 provides brief background on the cdna microarray procedure, and describes the data that result from this procedure applied to a chip Specifically, we explain the format by which the arrayer moves the cdna spots from the well plate to the microarray chip This section sets the stage, illustrating some of the different sources of systematic variation that exist in the data Section 3 introduces a proposed linearization scheme to balance the fluorescence intensities of the Cy3 and Cy5 dyes within the chip Further, this method is compared to the normalization approaches of Yang, et al (2001) and to a different, more robust, linearization scheme which uses Huber estimation After the data is normalized, initial models are considered to measure the systematic variation through (1) chip row and chip column location, (2) plate row, plate column, and plate ordering, and (3) arrayer pin number and time ordering Using the resulting information, we propose a forward stepwise ANOVA model, taking into account the relative standing of each factor in the overall model and its number of degrees of freedom We take such an approach toward the data because there exist interdependencies between the factors mentioned above with respect to their location This model is then compared to a nested model that considers chip row and column locations within the quadrant corresponding to the pin number Section 4 demonstrates the effect on gene selection before and after the systematic variation is removed Finally, Section 5 provides discussion, summarizing our procedure and comparing our results to the lowess procedure suggested by Yang, et al (2001) This paper highlights results for one particular chip, however other chips were analyzed as well The interested reader is referred to supplementary material for analogous results

4 4 Sellers, Miecznikowski, and Eddy 2 Data The cdna microarray procedure produces an image of the expression levels of a number of genes arranged on a microscope slide (Cheung et al, 1999; Yue et al, 2001) An image analysis algorithm is then applied to measure the intensity for each cdna spot on the chip (Dudoit, 2000; Yang, et al, 2002) Our raw data result from such an algorithm (GEMTools 24; Copyright Incyte Genomics, Inc) The ratio between the two intensities detected for any given spot is an estimate of the relative amount of the mrna corresponding to that element present in the original two samples The arrayer which generated the chips producing our data used four pins to transport the samples from 96-well plates to the microscope slide Each of the chips used in this experiment was created from 106 plates, where the first and last plates served as controls, and the remaining 104 plates contained the genes for which the differential expression information is desired 21 The Arrayer Procedure For our data, the four arrayer pins are arranged in a 2 2 matrix structure, approximately 125 mm on center, so that they transport the samples to the slide so that each pin fills one quadrant of the chip We refer to Pins 1, 2, 3, and 4 as those located in the upper-left, upperright, bottom-left, and bottom-right corners, respectively The spots are approximately 150µm in diameter, with respective centers 170µm apart from each other to ensure no overlap between spots (Yue et al, 2001) The array has the spots laid out in a array of spots More specifically, each of the four quadrants within the array (corresponding to pin number) has dimension 51 50, thus there are 2550 spots per quadrant (pin) The chip s spot locations are labeled consecutively row-wise within each pin, first numbering within Pin 1 (1-2550), followed by the spots within Pin

5 Systematic Microarray Data Variation 5 2 ( ), etc Thus, the spot location values range from 1 to Intensities The data we analyzed is contained in a spreadsheet and gives the intensity readings from the Cy3- and Cy5-labeled probes for each spot, as produced by the image processing software We let P 1 denote the intensity of the signal from Probe 1, which was Cy3-labeled, and P 2 denote the intensity of the Probe 2 signal that was Cy5-labeled for a specific spot For each spot i, we let log ( P1i P 2i ) = log(p 1i ) log(p 2i ), (1) denote the differential log expression between the two probes for that spot We should note at this point that there are 96/4 = 24 spots per quadrant per plate Since 2550 = (24 106) + 6 there can be 106 plates per chip with a remainder of six spots in each quadrant Put another way, this procedure produces the expression levels of genes = spots per chip arranged in a two-dimensional array on the slide that accommodates up to spots The remaining 24 = 6 4 spot locations (namely, spot locations , , , and ) remain unused and are, therefore, not considered within the analysis (see Figure 1a) Of the remaining spots on the microarray chip, 192 spots were transported from the control plates Specifically, the first and last plates used to create the chip served as controls, ie plates containing known differentially expressed genes Each plate contains 96 wells (96 2 = 192) As mentioned in Yang et al (2002), such genes tend to be highly expressed and, hence, may not be representative of other genes of interest Such is the case here the spread of the control data is significantly larger than that of the experimental data (see Figure 2), hence the controls did not supply us with a reasonable measure of comparison for our analysis, and are removed; see Figure 1b for a visual representation of those locations

6 Sellers, Miecznikowski, and Eddy (a) (b) (c) Fig 1 Chip representation: black denotes location of (a) unused spaces, (b) unused spaces and spots from control genes, (c) all spots removed from

locations from the analysis This results in a total of 24 + 192 + 2769 = 2985 locations that are not considered, while the remaining 10200-2985 = 7215 spots comprise the data used for this work; see

each i, the transformation to mean log intensity, A i = log 2 P1i P 2i, and log intensity ratio, M i = log 2 (P 1i /P 2i ) This transformation is a 45-degree clockwise rotation of the log intensity

6 6 Sellers, Miecznikowski, and Eddy (a) (b) (c) Fig 1 Chip representation: black denotes location of (a) unused spaces, (b) unused spaces and spots from control genes, (c) all spots removed from analysis (unused, control, and missing) not considered Finally, our dataset contains 2769 locations where the reported value for both samples exactly equals zero, therefore we excluded these locations from the analysis This results in a total of = 2985 locations that are not considered, while the remaining = 7215 spots comprise the data used for this work; see Figure 1c Thus, we consider log ( P1i P 2i ) = log(p 1i ) log(p 2i ), i = 1,, 7215, the differential log expression between the two probes for each spot i An alternative approach is to consider, for each i, the transformation to mean log intensity, A i = log 2 P1i P 2i, and log intensity ratio, M i = log 2 (P 1i /P 2i ) This transformation is a 45-degree clockwise rotation of the log intensity versus log intensity plot There are three obvious sources of possible systematic variation which are a consequence of the experimental procedure and do not contribute to differential gene expression The first is the physical layout on the glass slide; one can imagine that there are spatial effects across the slide

7 Systematic Microarray Data Variation Controls Experimental data Fig 2 Boxplot comparison of control versus experimental data The data values shown represent the difference of the logged intensities, log(p 1) log(p 2), in Equation (1) (caused, for example, by the way the dye-labeled material is applied to the slide) which would manifest as a pattern of row and/or column effects if the data were analyzed as a array We can already visually detect the existence of some effect from the spatial representation of the differential log expression (see Figure 3a), even though the scaling is heavily influenced by the presence of a few outliers These effects become more easily detectable when we consider the chip s spatial representation for the ranks of the differential log expression, displayed in Figure 3b! In this figure, there appears to be a strong spatial effect in relation to chip row and/or chip column We will explore this variation further in the next section The second obvious source of systematic variation stems from the 96-well plates which are the source of the spots on the glass slide; one can imagine that there are effects which are localized to one (or more) specific plates which would appear as localized effects on the glass slide Note that the localization is complicated because of the arrayer procedure described above; recall the

8 8 Sellers, Miecznikowski, and Eddy (a) (b) Fig 3 log(p 1) - log(p 2): (a) true chip representation, (b) ranked chip representation complex numbering scheme The third source is due to the pins themselves One can easily imagine that the pins vary in size or some other property that causes the observations to vary from quadrant to quadrant on the chip Equally, one can imagine a serial (in time) correlation among the observations caused by, for example, the pins not being properly cleaned between successive dips into the wells on the plates This is not intended to be an exhaustive list of possible sources of systematic variation, but simply a short list of obvious possibilities which we will explore further Note that we implicitly assume a random spatial distribution of the genes on the microarray chip

9 Systematic Microarray Data Variation 9 3 Models 31 Linearization Linearization of the microarray data serves to balance the fluorescence intensities of the two dyes within a chip Various approaches have been suggested to compare relative intensities of spots under the two fluorescences; see Dudoit, et al (2000) and Yang, et al (2001) As expressed in Yue, et al (2001), an ideal hybridization is one where the scatter plot of log(p 1 ) versus log(p 2 ) should show a signal distribution along a line with a slope of 1 ; equivalently, we want a slope of 0 in the A versus M plot (Yang, et al 2001, 2002b) Thus, a first step to normalizing the data is to linearize the relationship between log(p 1 ) and log(p 2 ) In order to appropriately compare expression levels, we must first account for and normalize for probe intensity readings that follow a trend not on the line, log(p 1 ) = log(p 2 ) There are a variety of papers addressing the issue of normalization (eg, Finkelstein et al, 2001; Rocke and Durbin, 2001; Colantuoni et al, 2002; Smyth and Speed, 2003) and variance stabilization (eg, Huber et al, 2002; Durbin et al, 2002) In particular, Yang, et al (2001, 2002b) discuss several approaches for handling within-slide normalization, namely (a) a global normalization by the median or mean of the log-intensity ratios for a particular gene set, (b) an intensity dependent normalization, and (c) a within-print-tip-group normalization via a lowess (Cleveland 1979, 1981) smoother In their analyses, the within-print-tip-group normalization produced the best results with regard to within-slide normalization, thus this procedure is the focus for our comparisons To be specific, their within-print-tip-group normalization performs a lowess smoother (f=20-40%) on M = log(p 1 /P 2 ) values within each quadrant corresponding to the associated pin or print-tip This is a robust nonlinear scatterplot smoother, thus we cannot determine the exact degrees of freedom used in the process (Buja, et al, 1989) We propose, instead, the following normalization scheme To linearize the relationship, we

10 10 Sellers, Miecznikowski, and Eddy want to find values for a, b, c R, such that n n ri 2 = i=1 i=1 [ log ( ) 2 P1i a + c] (2) P 2i b is minimized In particular, we constrain a and b such that 0 a < min(p 1i ) and 0 b < min(p 2i ) for all i There exists a unique solution, although it does not have a closed form; therefore, we solve this nonlinear minimization problem numerically For this dataset, we have â=10428, ˆb = 0, and ĉ=0435 For another comparison, we consider the Huber (1964) estimation n u 2 i if u i d ρ(r i ), where ρ(u i ) = i=1 d(2 u i d) if u i > d (3) with tuning parameter, d =1345, because it is 95% as efficient as Equation (2) when applied to normal residuals and is less affected by outliers (Hamilton, 1992) These estimation procedures made little difference in our resulting estimates for a, b, c (see Table 1), therefore we use the estimates derived from Equation (2) In addition, we studied the effect due to performing the linearization procedure within each quadrant More specifically, for j = 1,, 4, we want to find a j, b j, c j such that n n j [ rij 2 = log i=1 i=1 ( P1i,j a j P 2i,j b j ) + c j ] 2 (4) is minimized The results for â, ˆb, ĉ are provided in Table 1 As demonstrated from the estimates in Table 1 and in the left-hand plots in Figure 4, the data from Pin 3 appears different than the data from Pins 1, 2, and 4 After linearizing the data within each pin, however, we have accounted for this difference (each color denotes a particular quadrant in the microarray chip); for comparison, we also provide the M versus A plot in Figure 4(b) (Yang, et al 2001) Comparing the three linearization results, there appears to be little difference between the estimates for a, b and c Thus, to limit the use of degrees of freedom, we maintain our initial

11 Systematic Microarray Data Variation 11 Original values for log(p1) Original values for log(p2) Normalized values for log(p1) Normalized values for log(p2) Pin 1 Pin 2 Pin 3 Pin 4 (a) Original values for A Original values for M Normalized values for A Normalized values for M Pin 1 Pin 2 Pin 3 Pin 4 (b) Fig 4 Microarray example before and after transformation process: (a) log intensity versus log intensity plot, (b) mean log intensity (A) versus log intensity ratio (M) plot See results from Equation (4) in Table 1 for â,ˆb,ĉ values

12 12 Sellers, Miecznikowski, and Eddy Table 1 Linearization results for â, ˆb, ĉ using Equation (2) versus Equations (3) and (4), respectively Equation â ˆb ĉ (2) (3) (4) transformation procedure over the entire chip We now define L 1 = log(p 1 â) + ĉ and L 2 = log(p 2 ˆb) 32 Initial models Our second step in analyzing the relationship between spots on the microarray chip is to consider ANOVA models corresponding to each of the three possible sources of variation described in Section 22 We consider models relating the differential log expression with each of the following factor combinations: chip row and chip column locations; plate number, and plate row and column locations; and pin number and time order, respectively The effects from all of these factors among each of the three initial models was demonstrated to be significant; however, we think that the effect of time order was small (F 14, with 3 and 2235 degrees of freedom) compared to its very large number of degrees of freedom Further, we do not want to risk overfitting our data by including such a large number of degrees of freedom, thus the time order

13 Systematic Microarray Data Variation 13 factor is no longer considered in the remainder of the paper Of interest now is the effect of each of the remaining factors on the data as a complete model In order to determine the relative effect, we must proceed with caution This is due to the collinearity that exists between the factors; more detail is given in supplementary material at wwwbiostatisticsoupjournalsorg 33 Building a complete model Due to the interdependencies between the factors with regard to their location, we cannot blindly build a model that includes all of the factors Therefore, we build a forward stepwise ANOVA model, taking into account the relative standing of each factor in the overall model and its number of degrees of freedom In particular, we first consider the effect of the plate row location on the linearized differential log expression values Then, we use the residuals from the plate row model to measure the effect due to plate column The process continues as we consider the effect due to chip row, chip column, and plate number, respectively From these intermediate results, we can build a more representative ANOVA table for all of the factors within one model, determining the degrees of freedom for the residuals by taking the corrected total degrees of freedom minus the sum of the degrees of freedom for the factors in the model For this example, therefore, we obtain the following ANOVA table, displayed in Table 2 Although this again demonstrates the significance of each of the factors in their effect relating to differential expression levels, this table is still not exact because the degrees of freedom under such a formulation are not the true degrees of freedom for this model (this is due to the interdependencies of the factors) However, given the procedure by which the model was established, the degrees of freedom listed in Table 2 represent upper bounds on the true degrees of freedom As a result, the F-statistics will only increase, thus demonstrating an even greater significance of

14 14 Sellers, Miecznikowski, and Eddy Table 2 Approximate ANOVA table representing effect of all factors on log-differential expression These terms are added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) Pinwise Linearization Plate Row Plate Column Chip Row Chip Column Plate Number Residuals Grand Total the factor effects We can gain further insight from the information provided in Table 2, eg we see that the dominating factor in the above model is the linearization Meanwhile, the plate row, plate column, chip row, chip column, and plate number factor mean squares all much smaller, yet still demonstrating great significance in the model Finally, we check for any relationship among the estimated coefficients within the respective factors To aid us in this inquiry, we smooth the values to better help identify any patterns that may exist among the coefficients There appears to be a spatial pattern within chip column; a negative trend appears over the last 50 column values (this corresponds to Pins 2 and 4) There are also trends in the plate row and plate column coefficients; however, given the small number of coefficients for both of these factors, their significance is questionable The smoothing suggests that a slight trend exists among the chip row coefficients Upon inspecting the coefficients directly,

15 Systematic Microarray Data Variation 15 newprowcoeff newpcolcoeff newcrowcoeff newccolcoeff newplatecoeff Fig 5 Plot of coefficient values for each of the factor models referenced in Table 2 From (top) left to right: Plate Row, Plate Column, Chip Row, Chip Column, Plate Number however, this trend appears to be additional influence existing within Pins 3 and 4; there is a negative trend over the last 51 chip rows Finally, there does not appear to be any spatial trend among the coefficients for plate number To assess the validity of this model in a way that is independent of the data, we randomly sample 10 percent of the data and remove it from consideration (hereafter referred to as the test data), and build a model on the remaining 90 percent of the data (called the training data) The resulting linearization and analysis of variance results are approximately equal to that shown above with all of the data used in the analysis Further, when using the coefficients determined from the model using the training data, the resulting residual mean-squared error for the test data is 00307, approximately 10 percent greater than that for the training data Thus, we see that the model s effectiveness is not data-dependent

16 16 Sellers, Miecznikowski, and Eddy Table 3 Approximate ANOVA table representing nested effect of chip row and chip column within pin, and all other factors on log-differential expression These terms are added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) Pinwise Linearization Chip Row in Pin Chip Column in Pin Plate Row Plate Column Plate Number Residuals Grand Total Building a nested model One can argue that the pin or print-tip effect is not as obvious as the chip row and chip column effect, given that the row and column effects are just as striking within each quadrant (corresponding to the associated pin) as they are over the entire chip Thus, we consider a model where we nest the factors of chip row and chip column within the pin factor We find that all of the factors are again highly significant In particular, we notice an increase in the sum of squares value for the nested chip row and chip column factors within pin However, their increased degrees of freedom results in a slight decrease in mean square for chip row and chip column after considering these factors nested within pin Nonetheless, the overall effect of nesting does not change the relative significance of any of these factors with only a slight decrease in the mean-squared error in Table 3, compared to Table 2 Thus, we maintain consideration of

17 Systematic Microarray Data Variation 17 the full model shown in Table 2 35 Model comparison We consider how our linearization and analysis results for the general model compare to those achieved when using the lowess linearization proposed by Yang, et al (2001) To estimate the degrees of freedom corresponding to the lowess linearization (f = 04), we instead use loess(span=04,degree=1,family= symmetric ) in S-plus and obtain the degrees of freedom estimate via the summary() feature in S-plus The algorithm used to determine the loess degrees of freedom is provided in Cleveland and Devlin (1988) The results using the loess linearization are contained in Table 4 Because of the large sample size in the data, the fitted values obtained using loess do not agree exactly to the lowess fitted values The difference in values, however, does not appear to be significant We note that mean squared error value for our linearization differs from the loess results by approximately 20% and the apparent number of degrees of freedom needed for the loess procedure is not greatly different from that for our proposed model; this is a strong arguement in favor of the loess model Personally, we prefer the simpler linearization we can easily explain it We recognize, however, that the loess model provides a better fit without many additional degrees of freedom 4 Gene Selection In Section 3, we demonstrated that systematic variation exists within the microarray chip related to the linearization process, as well as the pin, plate row location, plate column location, chip row location, chip column location, and plate number factors In this section, we examine how these affect the detection of significant genes

18 18 Sellers, Miecznikowski, and Eddy Table 4 ANOVA result (general model) considering lowess linearization proposed by Yang, et al (2001) These terms are added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) Pinwise Loess Pin Plate Row Plate Column Chip Row Chip Column Plate Number Residuals Grand Total In this work, we demonstrate the effect of the systematic variation that was initially present on gene selection For each gene location in our sample, we plot its normalized value representing the observed difference of the logs versus its corresponding normalized corrected value (that is, the observed difference of the logs corrected for the systematic effects; we refer to these as the after normalization values) from the complete model Further, we first consider those genes whose normalized values fall outside of the [-3,3] interval either before any modelling was considered or after the values were updated This cutoff is in accordance with the work of DeRisi et al (1996), and was chosen simply to indicate the effect of the systematic variation Other thresholds reveal different numbers of significant genes but the patterns are similar The before-versus-after-normalization plane is divided into nine areas Points that land

19 Systematic Microarray Data Variation 19 (S,S) with Cy3-Cy5 (NS,S) (S,S) reclassified After (S,NS) (NS,NS) (S,NS) Normalization (S,S) (NS,S) (S,S) with Cy3-Cy5 reclassified Before normalization Fig 6 Possible outcomes for significance before versus after normalization The notation, (A, B), represents the significance outcome, where A denotes the significance or non-significance determined before correction, while B denotes the significance or non-significance detected after correction NS and S denote non-significance and significance, respectively in the top right and bottom left subspaces represent genes that are considered significant both before and after the correction The top and bottom center regions represent genes that are not significant before applying the model yet are significant after correction, and the center left and right subspaces contain those points that are significant before correction yet not significant after correction Finally, the upper left and lower right subspaces represent the most significant changes possible under this construct: the genes located here are significant both before and after correction, however the relationship between the Cy3 and Cy5 intensities has reversed Figure 6 provides a graphic representation of this notion The ±3 threshold choice is arbitrary; Figure 4 shows how the number of significantly expressed genes changes with the choice of cutoff value before and after normalization, respectively Figure 4a illustrates how the number of significantly expressed genes before normalization is influenced by the change in cutoff value (after normalization threshold remains fixed at ±3), while Figure 4b

20 20 Sellers, Miecznikowski, and Eddy shows this affect caused by changing the after normalization threshold value (before normalization cutoff fixed at ±3) We see that the number of genes that remain statistically insignificant increases gradually as the range of consideration grows from ±2 to ±4 (see Figure 4a) The same is true with regard to changing the range of inclusion in the after normalization axis (see Figure 4b) Further, note the location of spots relative to the thresholds Those regions with associated step function representations in Figure 4 show that spots are far apart from each other and small in number within that subspace under consideration Thus, the change in cutoff has little effect on the number of genes in these regions Finally, we see from Figure 4 that, for a ±3 threshold, the number of significantly expressed genes before versus after normalization fluctuates minimally around this threshold region Using ±3 as the threshold value, the before-versus-after-normalization plot for this data is provided in Figure 8 There are no genes in which the intensity ratio changes its significance with respect to Cy3 and Cy5 (ie no points are contained in the upper-left or bottom-right regions) There are 18 genes that are significantly expressed both before and after the correction, while 39 genes that would not initially be considered significantly expressed are now significant after adjusting the output using the model described in Section 33, and 31 genes are no longer significantly expressed after the model is considered; see Table 5 However, there does not appear to be any interesting relationship with respect to relative chip location or associated gene name We tested the effect of our normalization on the control data and found that the control data fell beyond the thresholds both before and after normalization Nonetheless, we do not mean to imply that those genes exceeding this threshold are differentially expressed

21 Systematic Microarray Data Variation (a) (b) Fig 7 Plots demonstrating the affect of a changing cutoff value (a) Before normalization cutoff changes, while after normalization cutoff remains fixed (b) After normalization cutoff changes, while before normalization cutoff remains fixed

22 22 Sellers, Miecznikowski, and Eddy Observed values (normalized) Residual values (normalized) Fig 8 Before normalization versus after normalization plot Table 5 Before versus after normalization contingency table

23 Systematic Microarray Data Variation 23 5 Discussion The introduction of the cdna microarray by Schena, et al (1995) and the analytical procedure of Shalon, et al (1996) are significant contributions made to the field of genetics There has been substantial statistical research recognizing the need to normalize the data in order to remove systematic variation due to the pins, with discussion noting the need for further consideration of additional factors causing systematic variation (Smyth and Speed, 2003) Through this exploratory work, we have done just that demonstrated the existence of other sources of significant systematic variation that go unaccounted for in standard analyses Specifically, there exists systematic variation due to the location on the chip, the location on the plate, and among the batch of plates It is necessary to remove this extra variation not only due to its effect on the size of the differential gene expression but, more significantly, because of the varied outcomes with respect to gene selection It is necessary to have a standard approach to remove the systematic variation within the chip First, we perform the linearization technique described in Section 31 This removes a great deal of the variation in the chip and balances the Cy3 and Cy5 intensities within the chip Next, we suggest removing the variation in a step-by-step fashion as discussed in Section 33 Summarizing, we first remove the variation due to chip row location, and then with respect to chip column Next, we successively remove the effect due to the plate row, plate column and plate number We compare our model to the lowess (f = 40%) approach of Yang et al (2001) Both procedures account for a significant amount of variation with the lowess model reducing the variation more than our model; however, we feel that our approach is easily explained and understood given the context of the experiment and data collection procedure

24 24 Sellers, Miecznikowski, and Eddy Acknowledgements The authors wish to thank Eleanor Fiengold for her helpful and insightful discussion on this paper and related work regarding microarray analysis, as well as the referees for their astute and insightful comments We also thank Karoly Mirnics for providing the data Kimberly Sellers and Jeffrey Miecznikowski were supported in part by the NSF-VIGRE program, grant number DMS References Brown, PO, Botstein, D (1999) Exploring the new world of the genome with DNA microarrays Nature Genetics Supplement Buja, A, Hastie, T, Tibshirani, R (1989) Linear Smoothers and Additive Models Annals of Statistics Cheung, V, Morley, M, Aguilar, F, Massimi, A, Kucherlapati, R Childs, G (1999) Making and reading microarrays Nature Genetics Supplement Cleveland, WS (1979) Robust locally weighted regression and smoothing scatterplots Journal of the American Statistical Association Cleveland, WS (1981) LOWESS: A program for smoothing scatterplots by robust locally weighted regression The American Statistician Cleveland, WS, Devlin, SJ (1988) Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting Journal of the American Statistical Association Colantuoni, C, Henry, G, Zeger, S, Pevsner, J (2002) SNOMAD (Standardization and Normalization of Microarray Data): web-accessible gene expression data analysis Bioinformatics DeRisi, J, Penland, L, Brown, PO Bittner, ML (1996) Use of a cdna microarray to analyse gene expression patterns in human cancer Nature Genetics

25 Systematic Microarray Data Variation 25 Duggan, DJ, Bittner, M, Chen, Y, Meltzer, P, Trent, JM (1999) Expression profiling using cdna microarrays Nature Genetics Supplement Durbin, BP, Hardin, JS, Hawkins, DM Rocke, DM (2002) A variance-stabilizing transformation for gene-expression microarray data Bioinformatics 18 S105 S110 Dudoit, S, Yang, YH, Callow, MJ Speed, TP (2000) Statistical methods for identifying differentially expressed genes in relicated cdna microarray experiments, Technical report 578, University of California at Berkeley, 1 38 Efron, B, Tibshirani, R, Storey, JD Tusher, V (2001) Empirical Bayes Analysis of a Microarray Experiment Journal of the American Statistical Association Finkelstein, DB, Ewing, R, Gollub, J, Sterky, F, Somerville, S, Cherry, JM (2001) Iterative linear regression by sector In: Methods of Microarray Data Analysis Papers from CAMDA 2000 eds SM Lin and KF Johnson, Kluwer Academic, Hamadeh, H Afshari, CA (2000) Gene Chips and Functional Genomics American Scientist Hamilton, L (1992) Regression with Graphics: A Second Course in Applied Statistics California: Duxbury Press Huber, PJ, (1964) Robust estimation of a location parameter Annals of Mathematical Statistics Huber, W, von Heydebreck, A, Sültmann, H, Poustka, A, Vingron, M (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression Bioinformatics 18 S96 S104 Kerr, MK, Afshari, CA, Bennett, L, Bushel, P, Martinez, J, Walker, J Churchill, GA (2001) Statistical Analysis of a Gene Expression Microarray Experiment with Replication Statistica Sinica to appear

26 26 Sellers, Miecznikowski, and Eddy Kerr, MK, Martin, M Churchill, GA (2000) Analysis of Variance for Gene Expression Microarray Data Journal of Computational Biology Lockhart, DJ, Winzeler, EA (2000) Genomics, gene expression and DNA arrays Nature Mallows, CL (1980) Some Theory of Nonlinear Smoothers Annals of Statistics Rocke, DM, Durbin, B (2001) A model for measurement error for gene expression arrays Journal of Computational Biology Schena, M, Shalon, D, Davis, RW Brown, PO (1995) Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray Science Shalon, D, Smith, SJ Brown, PO (1996) A DNA Microarray System for Analyzing Complex DNA Samples Using Two-color Fluorescent Probe Hybridization Genome Research Smyth, GK, Speed, T (2003) Normalization of cdna microarray data METHODS: Selecting Candidate Genes from DNA Array Screens: Application to Neuroscience, D Carter (ed) Yang, YH, Dudoit, S, Luu, P Speed, TP (2001) Normalization for cdna Microarray Data Microarrays: Optical Technologies and Informatics, ML Bittner, Y Chen, AN Dorsel, and ER Dougherty (eds), Proceedings of SPIE Yang, YH, Buckley, M, Dudoit, S Speed, TP (2002a) Comparison of Methods for Image Analysis on cdna Microarray Data Journal of Computational and Graphical Statistics Yang, YH, Dudoit, S, Luu, P, Lin, DM, Peng, V, Ngai, J, Speed, TP (2002b) Normalization for cdna microarray data: a robust composite method addressing single and multiple slide systematic variation Nucleic Acids Research 30(4): e15 Yue, H, Eastman, PS, Wang, B, Minor, J, Doctolero, MH, Nuttall, RL, Stack, R, Becker, JW, Montgomery, JR, Vainer, M Johnston, R (2001) An evaluation of the performance of cdna microarrays for detecting changes in global mrna expression Nucleic Acids Research 29 e41

27 Systematic Microarray Data Variation 27 A List of Tables Table 1: Linearization results for â, ˆb, ĉ using Equation (2) versus Equations (3) and (4), respectively Table 2: Approximate ANOVA table representing effect of all factors on log-differential expression These terms are added sequentially (first to last) Table 3: Approximate ANOVA table representing nested effect of chip row and chip column within pin, and all other factors on log-differential expression These terms are added sequentially (first to last) Table 4: ANOVA result (general model) considering lowess linearization proposed by Yang, et al (2001) These terms are added sequentially (first to last) Table 5: Before versus after normalization contingency table B List of Figures Figure 1: Chip representation: black denotes location of (a) unused spaces, (b) unused spaces and spots from control genes, (c) all spots removed from analysis (unused, control, and missing) Figure 2: Boxplot comparison of control versus experimental data The data values shown represent the difference of the logged intensities, log(p 1 ) log(p 2 ), in Equation (1) Figure 3: log(p 1 ) - log(p 2 ): (a) true chip representation, (b) ranked chip representation Figure 4: Microarray example before and after transformation process: (a) log intensity versus log intensity plot, (b) mean log intensity (A) versus log intensity ratio (M) plot See Equation (4) results in Table 1 for â,ˆb,ĉ values

28 28 Sellers, Miecznikowski, and Eddy Figure 5: Plot of coefficient values for each of the factor models referenced in Table 2 From (top) left to right: Plate Row, Plate Column, Chip Row, Chip Column, Plate Number Figure 6: Possible outcomes for significance before versus after correction The notation, (A, B), represents the significance outcome, where A denotes the significance or non-significance determined before correction, while B denotes the significance or nonsignificance detected after correction NS and S denote non-significance and significance, respectively Figure 4: Plots demonstrating the affect of a changing cutoff value (a) Before normalization cutoff changes, while after normalization cutoff remains fixed (b) After normalization cutoff changes, while before normalization cutoff remains fixed Figure 8: Before normalization versus after normalization plot C Supplementary Material The following material is intended for inclusion on the website, wwwbiostatisticsoupjournalsorg, as supplementary material to the above paper D Relationships to spot location We consider the data in relation to the following factors: pin, plate row, plate column, chip row, chip column, and plate number It is worth noting, however, that all of these factors can be expressed by their location on the microarray chip, i = 1,, 10200, where the 24 unused locations occur for i = 2545,, 2550; 5095,, 5100; 7645,, 7650; and 10195,, Because the associated spots are unused and, therefore, not considered in the analyses, we do not concern ourselves with these locations in the formulae below To ease the computations, we

29 Systematic Microarray Data Variation 29 define i := i(mod 2550); thus the following relationships will be expressed in terms of i and i The pin or quadrant value, q i, is easily related to the location of the spot: i q i = 2550 We can explain the relationship of the ordered spots to their location on a plate with respect to plate row (denoted j) and plate column (denoted k) The plate row location is described as 2 [i(mod 2550)](mod 24) 6 1 = 2 i (mod 24) 6 1, for i 0(mod 24) j i =, 7, for i = 0(mod 24) if 1 i 5100, and 2 [i(mod 2550)](mod 24) 6 = 2 i (mod 24) 6, for i 0(mod 24) j i = 8, for i = 0(mod 24), if 5101 i For plate column, 2{[i(mod 2550)](mod 6) 1} + 1 = 2[i (mod 6) 1] + 1, for i (mod 6) 0 k i = 11, for i (mod 6) = 0, if 1 i 2550 or 5101 i 7650, and 2{[i(mod 2550)](mod 6)} = 2[i (mod 6)], for i (mod 6) 0 k i = 12, for i (mod 6) = 0, if 2551 i 5100 or 7650 i The chip row and chip column locations (denoted r i and c i, respectively) are easily described in terms of the array location The function, i(mod 2550) 50 = i 50 r i = 51 + i(mod 2550) 50 = 51 + i 50, for i = 1,, 5100, for i = 5101,, 10200,

30 30 Sellers, Miecznikowski, and Eddy describes the relationship between i, i and chip row, while i(mod 100), for i 0(mod 100) c i = 100, for i = 0(mod 100) relates i to chip column Let n i denote the plate number on which the spot is located (n i = 1,, 106) Thus, the plate number is also a function in terms of the spot location in the array representation in that i(mod 2550) i n i = = E Results for Microarray Chip 2 We first remove from analysis the data corresponding to the location of unused spaces, controls, and missing data Particularly, the controls are removed because the spread of the control data is significantly larger than that of the experimental data with a substantial number of outliers (see Figure 9), hence they do not provide a reasonable measure of comparison for our analysis From viewing the chip representations of the actual differential log expression (Figure 10a) and particularly its associated ranked version (Figure 10b), there does not appear to be any significant trend that is detectable by the eye Nonetheless, we proceed with further analysis to confirm this hypothesis First, by plotting the log(p 1 ) versus log(p 2 ) and A versus M scatterplots, we witness an interesting form of clustering in that the majority of the log(p 1 ) values are between 4 and 55 while the majority of log(p 2 ) values cluster between 45 and 6; meanwhile, the remaining values are more scattered in a somewhat curvilinear trend; see Figures 11a and b By performing the linearization technique described in Equation (4), this curvature is slightly diminished and the trend is thus generally linear with a slope of 1 in Figure 11a (equivalently, the trend has a slope of 0 in Figure 11b) Note that the estimates for â, ˆb, ĉ cluster between Pins 1 and 3, and Pins 2

31 Systematic Microarray Data Variation Controls Experimental data Fig 9 Boxplot comparison of control versus experimental data for Chip 2 The data values shown represent the difference of the logged intensities, log(p 1) log(p 2) (a) (b) Fig 10 log(p 1) - log(p 2) for Chip 2: (a) true chip representation, (b) ranked chip representation

32 32 Sellers, Miecznikowski, and Eddy Table 6 Chip 2 linearization results for â, ˆb, ĉ using Equation (2) versus Equations (3) and (4), respectively Equation â ˆb ĉ (2) (3) (4) and 4; see Table 6 There appears to be a significant change due to the proposed linearization process for Chip 2 When the complete model is considered, only the linearization and chip row factors are significant, while the remaining factors have small mean sums of squares values; see Table 7 for the approximate analysis of variance table Thus, we see that the systematic variation within this chip is due largely to the linearization, and chip row effects As before, the nested model in Table 8 does not show any clear advantages over the general model, therefore we maintain use of the proposed general model shown in Table 7 In comparison with the lowess model of Table 9, we find that the lowess model appears to perform better, accounting for a MSE approximately 20% smaller than that for our proposed model The model s results are not data-dependent through consideration of training and test data The resulting linearization and analysis of variance results are approximately equal to that shown above with all of the data used in the analysis, and the mean-square error for the test data is

33 Systematic Microarray Data Variation 33 Original values for log(p1) Original values for log(p2) Normalized values for log(p1) Normalized values for log(p2) Pin 1 Pin 2 Pin 3 Pin 4 (a) Original values for A Original values for M Normalized values for A Normalized values for M Pin 1 Pin 2 Pin 3 Pin 4 (b) Fig 11 Microarray example before and after transformation process: (a) log intensity versus log intensity plot, (b) mean log intensity (A) versus log intensity ratio (M) plot See results from Equation (4) in Table 1 for â,ˆb,ĉ values

34 34 Sellers, Miecznikowski, and Eddy Table 7 Approximate ANOVA table representing effect of all factors on log-differential expression for Chip 2 These terms are added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) Pinwise Linearization Plate Row Plate Column Chip Row Chip Column Plate Number Residuals Grand Total , approximately 10 percent greater than that for the training data In searching for any spatial relationships within the various factors in the general model, there do not appear to be any significant trends Although the first four plate row coefficients have an increasing trend while the last four successively decrease, it is difficult to ascertain a trend given the small number of coefficients Similarly, the plot of chip column coefficients is interesting in that there is a slight quadratic trend over column location Further, the majority of coefficients fall within [-007, 007] with two outlying coefficient values corresponding to the 59th and 73rd columns of the chip Refer to Figure 12 for a spatial representation of the coefficients corresponding to pin, plate row, plate column, chip row, chip column, and plate number Finally, in consideration of gene selection, the before versus after normalization plot for Chip 2 is provided in Figure 13 with its associated contingency table, Table 10, providing the number

Bioconductor Project Working Papers

Bioconductor Project Working Papers Bioconductor Project Year 2004 Paper 6 Error models for microarray intensities Wolfgang Huber Anja von Heydebreck Martin Vingron Department of Molecular Genome Analysis,