Systematic Variation in Genetic Microarray Data

Size: px
Start display at page:

Download "Systematic Variation in Genetic Microarray Data"

Transcription

1 Biostatistics (2004), 1, 1, pp 1 47 Printed in Great Britain Systematic Variation in Genetic Microarray Data By KIMBERLY F SELLERS, JEFFREY MIECZNIKOWSKI, andwilliam F EDDY Department of Statistics, Carnegie Mellon University, Pittsburgh, PA ksellers@statcmuedu Summary The main focus in cdna microarray analysis is determining which genes are differentially expressed Scientists apply known statistical methods to model the structure of the experiment or develop new approaches for assessing statistical significance and assume that the data consist of the signal plus random noise Here, we report the results of some exploratory analyses of such data that show the existence of sources of significant systematic variation which are not necessarily accounted for in standard analyses In particular, we construct a linearization procedure and compare its effectiveness with that of Yang, et al (2001) Furthermore, we consider not only the variation due to the pin/print-tip as in previous work, but also the row and column location on the microarray chip, and the relative location from the well-plate Removal of this extra variation can affect both the size of differential gene expression, and which genes are inferred to be differentially expressed Some key words: Systematic variation, Analysis of Variance, Microarray analysis, Differential gene expression, Gene detection

2 2 Sellers, Miecznikowski, and Eddy 1 Introduction The human genome consists of DNA sequences located within the nucleus of each cell Specific DNA sequences are copied (transcribed) into messenger RNA (mrna) These mrna copies then move from the nucleus to the cytoplasm of the cell so that the corresponding sequence can be used to manufacture various protein molecules Genetic microarray technology makes use of this process (Brown and Botstein, 1999; Duggan et al, 1999; Lockhart and Winzeler, 2000) Target complementary DNA (cdna) elements are laid out on a glass slide and are probed with dye-labeled samples The target cdna elements are generated in advance using the polymerase chain reaction and are physically arrayed in a two-dimensional grid on a chemically-modified glass slide Then, in the two-dye method we are considering here, equal amounts of two purified mrna samples are separately reverse transcribed using primer sets labeled with two different fluorescent dyes The two resulting dye-labeled samples are used as probes in a competitive hybridization reaction with the target elements on the chip After hybridization, a scanner generates two images of the chip at the wavelengths of light corresponding to the two dyes The images are processed to generate a single number corresponding to each sample for each spot on the chip Dudoit, et al (2000) and Yang, et al (2002), for example, compare processing and analysis methods used on the images that generate the microarray data The raw data in this work are the values resulting from such image processing Genetic microarray data of this type has become an important object of statistical research; see, eg, Kerr, et al (2000) and Efron, et al (2001) These analyses focus on deciding which genes are differentially expressed They apply known statistical methods to model the structure of the experiment or develop new approaches for assessing statistical significance and assume that the data consist of the signal plus random noise Here we ask the question, Is this assumption

3 Systematic Microarray Data Variation 3 correct, or are the data in fact contaminated with systematic variation (which may reduce the signal-to-noise ratio to the point of obscuring significant differences)? Section 2 provides brief background on the cdna microarray procedure, and describes the data that result from this procedure applied to a chip Specifically, we explain the format by which the arrayer moves the cdna spots from the well plate to the microarray chip This section sets the stage, illustrating some of the different sources of systematic variation that exist in the data Section 3 introduces a proposed linearization scheme to balance the fluorescence intensities of the Cy3 and Cy5 dyes within the chip Further, this method is compared to the normalization approaches of Yang, et al (2001) and to a different, more robust, linearization scheme which uses Huber estimation After the data is normalized, initial models are considered to measure the systematic variation through (1) chip row and chip column location, (2) plate row, plate column, and plate ordering, and (3) arrayer pin number and time ordering Using the resulting information, we propose a forward stepwise ANOVA model, taking into account the relative standing of each factor in the overall model and its number of degrees of freedom We take such an approach toward the data because there exist interdependencies between the factors mentioned above with respect to their location This model is then compared to a nested model that considers chip row and column locations within the quadrant corresponding to the pin number Section 4 demonstrates the effect on gene selection before and after the systematic variation is removed Finally, Section 5 provides discussion, summarizing our procedure and comparing our results to the lowess procedure suggested by Yang, et al (2001) This paper highlights results for one particular chip, however other chips were analyzed as well The interested reader is referred to supplementary material for analogous results

4 4 Sellers, Miecznikowski, and Eddy 2 Data The cdna microarray procedure produces an image of the expression levels of a number of genes arranged on a microscope slide (Cheung et al, 1999; Yue et al, 2001) An image analysis algorithm is then applied to measure the intensity for each cdna spot on the chip (Dudoit, 2000; Yang, et al, 2002) Our raw data result from such an algorithm (GEMTools 24; Copyright Incyte Genomics, Inc) The ratio between the two intensities detected for any given spot is an estimate of the relative amount of the mrna corresponding to that element present in the original two samples The arrayer which generated the chips producing our data used four pins to transport the samples from 96-well plates to the microscope slide Each of the chips used in this experiment was created from 106 plates, where the first and last plates served as controls, and the remaining 104 plates contained the genes for which the differential expression information is desired 21 The Arrayer Procedure For our data, the four arrayer pins are arranged in a 2 2 matrix structure, approximately 125 mm on center, so that they transport the samples to the slide so that each pin fills one quadrant of the chip We refer to Pins 1, 2, 3, and 4 as those located in the upper-left, upperright, bottom-left, and bottom-right corners, respectively The spots are approximately 150µm in diameter, with respective centers 170µm apart from each other to ensure no overlap between spots (Yue et al, 2001) The array has the spots laid out in a array of spots More specifically, each of the four quadrants within the array (corresponding to pin number) has dimension 51 50, thus there are 2550 spots per quadrant (pin) The chip s spot locations are labeled consecutively row-wise within each pin, first numbering within Pin 1 (1-2550), followed by the spots within Pin

5 Systematic Microarray Data Variation 5 2 ( ), etc Thus, the spot location values range from 1 to Intensities The data we analyzed is contained in a spreadsheet and gives the intensity readings from the Cy3- and Cy5-labeled probes for each spot, as produced by the image processing software We let P 1 denote the intensity of the signal from Probe 1, which was Cy3-labeled, and P 2 denote the intensity of the Probe 2 signal that was Cy5-labeled for a specific spot For each spot i, we let log ( P1i P 2i ) = log(p 1i ) log(p 2i ), (1) denote the differential log expression between the two probes for that spot We should note at this point that there are 96/4 = 24 spots per quadrant per plate Since 2550 = (24 106) + 6 there can be 106 plates per chip with a remainder of six spots in each quadrant Put another way, this procedure produces the expression levels of genes = spots per chip arranged in a two-dimensional array on the slide that accommodates up to spots The remaining 24 = 6 4 spot locations (namely, spot locations , , , and ) remain unused and are, therefore, not considered within the analysis (see Figure 1a) Of the remaining spots on the microarray chip, 192 spots were transported from the control plates Specifically, the first and last plates used to create the chip served as controls, ie plates containing known differentially expressed genes Each plate contains 96 wells (96 2 = 192) As mentioned in Yang et al (2002), such genes tend to be highly expressed and, hence, may not be representative of other genes of interest Such is the case here the spread of the control data is significantly larger than that of the experimental data (see Figure 2), hence the controls did not supply us with a reasonable measure of comparison for our analysis, and are removed; see Figure 1b for a visual representation of those locations

6 6 Sellers, Miecznikowski, and Eddy (a) (b) (c) Fig 1 Chip representation: black denotes location of (a) unused spaces, (b) unused spaces and spots from control genes, (c) all spots removed from analysis (unused, control, and missing) not considered Finally, our dataset contains 2769 locations where the reported value for both samples exactly equals zero, therefore we excluded these locations from the analysis This results in a total of = 2985 locations that are not considered, while the remaining = 7215 spots comprise the data used for this work; see Figure 1c Thus, we consider log ( P1i P 2i ) = log(p 1i ) log(p 2i ), i = 1,, 7215, the differential log expression between the two probes for each spot i An alternative approach is to consider, for each i, the transformation to mean log intensity, A i = log 2 P1i P 2i, and log intensity ratio, M i = log 2 (P 1i /P 2i ) This transformation is a 45-degree clockwise rotation of the log intensity versus log intensity plot There are three obvious sources of possible systematic variation which are a consequence of the experimental procedure and do not contribute to differential gene expression The first is the physical layout on the glass slide; one can imagine that there are spatial effects across the slide

7 Systematic Microarray Data Variation Controls Experimental data Fig 2 Boxplot comparison of control versus experimental data The data values shown represent the difference of the logged intensities, log(p 1) log(p 2), in Equation (1) (caused, for example, by the way the dye-labeled material is applied to the slide) which would manifest as a pattern of row and/or column effects if the data were analyzed as a array We can already visually detect the existence of some effect from the spatial representation of the differential log expression (see Figure 3a), even though the scaling is heavily influenced by the presence of a few outliers These effects become more easily detectable when we consider the chip s spatial representation for the ranks of the differential log expression, displayed in Figure 3b! In this figure, there appears to be a strong spatial effect in relation to chip row and/or chip column We will explore this variation further in the next section The second obvious source of systematic variation stems from the 96-well plates which are the source of the spots on the glass slide; one can imagine that there are effects which are localized to one (or more) specific plates which would appear as localized effects on the glass slide Note that the localization is complicated because of the arrayer procedure described above; recall the

8 8 Sellers, Miecznikowski, and Eddy (a) (b) Fig 3 log(p 1) - log(p 2): (a) true chip representation, (b) ranked chip representation complex numbering scheme The third source is due to the pins themselves One can easily imagine that the pins vary in size or some other property that causes the observations to vary from quadrant to quadrant on the chip Equally, one can imagine a serial (in time) correlation among the observations caused by, for example, the pins not being properly cleaned between successive dips into the wells on the plates This is not intended to be an exhaustive list of possible sources of systematic variation, but simply a short list of obvious possibilities which we will explore further Note that we implicitly assume a random spatial distribution of the genes on the microarray chip

9 Systematic Microarray Data Variation 9 3 Models 31 Linearization Linearization of the microarray data serves to balance the fluorescence intensities of the two dyes within a chip Various approaches have been suggested to compare relative intensities of spots under the two fluorescences; see Dudoit, et al (2000) and Yang, et al (2001) As expressed in Yue, et al (2001), an ideal hybridization is one where the scatter plot of log(p 1 ) versus log(p 2 ) should show a signal distribution along a line with a slope of 1 ; equivalently, we want a slope of 0 in the A versus M plot (Yang, et al 2001, 2002b) Thus, a first step to normalizing the data is to linearize the relationship between log(p 1 ) and log(p 2 ) In order to appropriately compare expression levels, we must first account for and normalize for probe intensity readings that follow a trend not on the line, log(p 1 ) = log(p 2 ) There are a variety of papers addressing the issue of normalization (eg, Finkelstein et al, 2001; Rocke and Durbin, 2001; Colantuoni et al, 2002; Smyth and Speed, 2003) and variance stabilization (eg, Huber et al, 2002; Durbin et al, 2002) In particular, Yang, et al (2001, 2002b) discuss several approaches for handling within-slide normalization, namely (a) a global normalization by the median or mean of the log-intensity ratios for a particular gene set, (b) an intensity dependent normalization, and (c) a within-print-tip-group normalization via a lowess (Cleveland 1979, 1981) smoother In their analyses, the within-print-tip-group normalization produced the best results with regard to within-slide normalization, thus this procedure is the focus for our comparisons To be specific, their within-print-tip-group normalization performs a lowess smoother (f=20-40%) on M = log(p 1 /P 2 ) values within each quadrant corresponding to the associated pin or print-tip This is a robust nonlinear scatterplot smoother, thus we cannot determine the exact degrees of freedom used in the process (Buja, et al, 1989) We propose, instead, the following normalization scheme To linearize the relationship, we

10 10 Sellers, Miecznikowski, and Eddy want to find values for a, b, c R, such that n n ri 2 = i=1 i=1 [ log ( ) 2 P1i a + c] (2) P 2i b is minimized In particular, we constrain a and b such that 0 a < min(p 1i ) and 0 b < min(p 2i ) for all i There exists a unique solution, although it does not have a closed form; therefore, we solve this nonlinear minimization problem numerically For this dataset, we have â=10428, ˆb = 0, and ĉ=0435 For another comparison, we consider the Huber (1964) estimation n u 2 i if u i d ρ(r i ), where ρ(u i ) = i=1 d(2 u i d) if u i > d (3) with tuning parameter, d =1345, because it is 95% as efficient as Equation (2) when applied to normal residuals and is less affected by outliers (Hamilton, 1992) These estimation procedures made little difference in our resulting estimates for a, b, c (see Table 1), therefore we use the estimates derived from Equation (2) In addition, we studied the effect due to performing the linearization procedure within each quadrant More specifically, for j = 1,, 4, we want to find a j, b j, c j such that n n j [ rij 2 = log i=1 i=1 ( P1i,j a j P 2i,j b j ) + c j ] 2 (4) is minimized The results for â, ˆb, ĉ are provided in Table 1 As demonstrated from the estimates in Table 1 and in the left-hand plots in Figure 4, the data from Pin 3 appears different than the data from Pins 1, 2, and 4 After linearizing the data within each pin, however, we have accounted for this difference (each color denotes a particular quadrant in the microarray chip); for comparison, we also provide the M versus A plot in Figure 4(b) (Yang, et al 2001) Comparing the three linearization results, there appears to be little difference between the estimates for a, b and c Thus, to limit the use of degrees of freedom, we maintain our initial

11 Systematic Microarray Data Variation 11 Original values for log(p1) Original values for log(p2) Normalized values for log(p1) Normalized values for log(p2) Pin 1 Pin 2 Pin 3 Pin 4 (a) Original values for A Original values for M Normalized values for A Normalized values for M Pin 1 Pin 2 Pin 3 Pin 4 (b) Fig 4 Microarray example before and after transformation process: (a) log intensity versus log intensity plot, (b) mean log intensity (A) versus log intensity ratio (M) plot See results from Equation (4) in Table 1 for â,ˆb,ĉ values

12 12 Sellers, Miecznikowski, and Eddy Table 1 Linearization results for â, ˆb, ĉ using Equation (2) versus Equations (3) and (4), respectively Equation â ˆb ĉ (2) (3) (4) transformation procedure over the entire chip We now define L 1 = log(p 1 â) + ĉ and L 2 = log(p 2 ˆb) 32 Initial models Our second step in analyzing the relationship between spots on the microarray chip is to consider ANOVA models corresponding to each of the three possible sources of variation described in Section 22 We consider models relating the differential log expression with each of the following factor combinations: chip row and chip column locations; plate number, and plate row and column locations; and pin number and time order, respectively The effects from all of these factors among each of the three initial models was demonstrated to be significant; however, we think that the effect of time order was small (F 14, with 3 and 2235 degrees of freedom) compared to its very large number of degrees of freedom Further, we do not want to risk overfitting our data by including such a large number of degrees of freedom, thus the time order

13 Systematic Microarray Data Variation 13 factor is no longer considered in the remainder of the paper Of interest now is the effect of each of the remaining factors on the data as a complete model In order to determine the relative effect, we must proceed with caution This is due to the collinearity that exists between the factors; more detail is given in supplementary material at wwwbiostatisticsoupjournalsorg 33 Building a complete model Due to the interdependencies between the factors with regard to their location, we cannot blindly build a model that includes all of the factors Therefore, we build a forward stepwise ANOVA model, taking into account the relative standing of each factor in the overall model and its number of degrees of freedom In particular, we first consider the effect of the plate row location on the linearized differential log expression values Then, we use the residuals from the plate row model to measure the effect due to plate column The process continues as we consider the effect due to chip row, chip column, and plate number, respectively From these intermediate results, we can build a more representative ANOVA table for all of the factors within one model, determining the degrees of freedom for the residuals by taking the corrected total degrees of freedom minus the sum of the degrees of freedom for the factors in the model For this example, therefore, we obtain the following ANOVA table, displayed in Table 2 Although this again demonstrates the significance of each of the factors in their effect relating to differential expression levels, this table is still not exact because the degrees of freedom under such a formulation are not the true degrees of freedom for this model (this is due to the interdependencies of the factors) However, given the procedure by which the model was established, the degrees of freedom listed in Table 2 represent upper bounds on the true degrees of freedom As a result, the F-statistics will only increase, thus demonstrating an even greater significance of

14 14 Sellers, Miecznikowski, and Eddy Table 2 Approximate ANOVA table representing effect of all factors on log-differential expression These terms are added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) Pinwise Linearization Plate Row Plate Column Chip Row Chip Column Plate Number Residuals Grand Total the factor effects We can gain further insight from the information provided in Table 2, eg we see that the dominating factor in the above model is the linearization Meanwhile, the plate row, plate column, chip row, chip column, and plate number factor mean squares all much smaller, yet still demonstrating great significance in the model Finally, we check for any relationship among the estimated coefficients within the respective factors To aid us in this inquiry, we smooth the values to better help identify any patterns that may exist among the coefficients There appears to be a spatial pattern within chip column; a negative trend appears over the last 50 column values (this corresponds to Pins 2 and 4) There are also trends in the plate row and plate column coefficients; however, given the small number of coefficients for both of these factors, their significance is questionable The smoothing suggests that a slight trend exists among the chip row coefficients Upon inspecting the coefficients directly,

15 Systematic Microarray Data Variation 15 newprowcoeff newpcolcoeff newcrowcoeff newccolcoeff newplatecoeff Fig 5 Plot of coefficient values for each of the factor models referenced in Table 2 From (top) left to right: Plate Row, Plate Column, Chip Row, Chip Column, Plate Number however, this trend appears to be additional influence existing within Pins 3 and 4; there is a negative trend over the last 51 chip rows Finally, there does not appear to be any spatial trend among the coefficients for plate number To assess the validity of this model in a way that is independent of the data, we randomly sample 10 percent of the data and remove it from consideration (hereafter referred to as the test data), and build a model on the remaining 90 percent of the data (called the training data) The resulting linearization and analysis of variance results are approximately equal to that shown above with all of the data used in the analysis Further, when using the coefficients determined from the model using the training data, the resulting residual mean-squared error for the test data is 00307, approximately 10 percent greater than that for the training data Thus, we see that the model s effectiveness is not data-dependent

16 16 Sellers, Miecznikowski, and Eddy Table 3 Approximate ANOVA table representing nested effect of chip row and chip column within pin, and all other factors on log-differential expression These terms are added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) Pinwise Linearization Chip Row in Pin Chip Column in Pin Plate Row Plate Column Plate Number Residuals Grand Total Building a nested model One can argue that the pin or print-tip effect is not as obvious as the chip row and chip column effect, given that the row and column effects are just as striking within each quadrant (corresponding to the associated pin) as they are over the entire chip Thus, we consider a model where we nest the factors of chip row and chip column within the pin factor We find that all of the factors are again highly significant In particular, we notice an increase in the sum of squares value for the nested chip row and chip column factors within pin However, their increased degrees of freedom results in a slight decrease in mean square for chip row and chip column after considering these factors nested within pin Nonetheless, the overall effect of nesting does not change the relative significance of any of these factors with only a slight decrease in the mean-squared error in Table 3, compared to Table 2 Thus, we maintain consideration of

17 Systematic Microarray Data Variation 17 the full model shown in Table 2 35 Model comparison We consider how our linearization and analysis results for the general model compare to those achieved when using the lowess linearization proposed by Yang, et al (2001) To estimate the degrees of freedom corresponding to the lowess linearization (f = 04), we instead use loess(span=04,degree=1,family= symmetric ) in S-plus and obtain the degrees of freedom estimate via the summary() feature in S-plus The algorithm used to determine the loess degrees of freedom is provided in Cleveland and Devlin (1988) The results using the loess linearization are contained in Table 4 Because of the large sample size in the data, the fitted values obtained using loess do not agree exactly to the lowess fitted values The difference in values, however, does not appear to be significant We note that mean squared error value for our linearization differs from the loess results by approximately 20% and the apparent number of degrees of freedom needed for the loess procedure is not greatly different from that for our proposed model; this is a strong arguement in favor of the loess model Personally, we prefer the simpler linearization we can easily explain it We recognize, however, that the loess model provides a better fit without many additional degrees of freedom 4 Gene Selection In Section 3, we demonstrated that systematic variation exists within the microarray chip related to the linearization process, as well as the pin, plate row location, plate column location, chip row location, chip column location, and plate number factors In this section, we examine how these affect the detection of significant genes

18 18 Sellers, Miecznikowski, and Eddy Table 4 ANOVA result (general model) considering lowess linearization proposed by Yang, et al (2001) These terms are added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) Pinwise Loess Pin Plate Row Plate Column Chip Row Chip Column Plate Number Residuals Grand Total In this work, we demonstrate the effect of the systematic variation that was initially present on gene selection For each gene location in our sample, we plot its normalized value representing the observed difference of the logs versus its corresponding normalized corrected value (that is, the observed difference of the logs corrected for the systematic effects; we refer to these as the after normalization values) from the complete model Further, we first consider those genes whose normalized values fall outside of the [-3,3] interval either before any modelling was considered or after the values were updated This cutoff is in accordance with the work of DeRisi et al (1996), and was chosen simply to indicate the effect of the systematic variation Other thresholds reveal different numbers of significant genes but the patterns are similar The before-versus-after-normalization plane is divided into nine areas Points that land

19 Systematic Microarray Data Variation 19 (S,S) with Cy3-Cy5 (NS,S) (S,S) reclassified After (S,NS) (NS,NS) (S,NS) Normalization (S,S) (NS,S) (S,S) with Cy3-Cy5 reclassified Before normalization Fig 6 Possible outcomes for significance before versus after normalization The notation, (A, B), represents the significance outcome, where A denotes the significance or non-significance determined before correction, while B denotes the significance or non-significance detected after correction NS and S denote non-significance and significance, respectively in the top right and bottom left subspaces represent genes that are considered significant both before and after the correction The top and bottom center regions represent genes that are not significant before applying the model yet are significant after correction, and the center left and right subspaces contain those points that are significant before correction yet not significant after correction Finally, the upper left and lower right subspaces represent the most significant changes possible under this construct: the genes located here are significant both before and after correction, however the relationship between the Cy3 and Cy5 intensities has reversed Figure 6 provides a graphic representation of this notion The ±3 threshold choice is arbitrary; Figure 4 shows how the number of significantly expressed genes changes with the choice of cutoff value before and after normalization, respectively Figure 4a illustrates how the number of significantly expressed genes before normalization is influenced by the change in cutoff value (after normalization threshold remains fixed at ±3), while Figure 4b

20 20 Sellers, Miecznikowski, and Eddy shows this affect caused by changing the after normalization threshold value (before normalization cutoff fixed at ±3) We see that the number of genes that remain statistically insignificant increases gradually as the range of consideration grows from ±2 to ±4 (see Figure 4a) The same is true with regard to changing the range of inclusion in the after normalization axis (see Figure 4b) Further, note the location of spots relative to the thresholds Those regions with associated step function representations in Figure 4 show that spots are far apart from each other and small in number within that subspace under consideration Thus, the change in cutoff has little effect on the number of genes in these regions Finally, we see from Figure 4 that, for a ±3 threshold, the number of significantly expressed genes before versus after normalization fluctuates minimally around this threshold region Using ±3 as the threshold value, the before-versus-after-normalization plot for this data is provided in Figure 8 There are no genes in which the intensity ratio changes its significance with respect to Cy3 and Cy5 (ie no points are contained in the upper-left or bottom-right regions) There are 18 genes that are significantly expressed both before and after the correction, while 39 genes that would not initially be considered significantly expressed are now significant after adjusting the output using the model described in Section 33, and 31 genes are no longer significantly expressed after the model is considered; see Table 5 However, there does not appear to be any interesting relationship with respect to relative chip location or associated gene name We tested the effect of our normalization on the control data and found that the control data fell beyond the thresholds both before and after normalization Nonetheless, we do not mean to imply that those genes exceeding this threshold are differentially expressed

21 Systematic Microarray Data Variation (a) (b) Fig 7 Plots demonstrating the affect of a changing cutoff value (a) Before normalization cutoff changes, while after normalization cutoff remains fixed (b) After normalization cutoff changes, while before normalization cutoff remains fixed

22 22 Sellers, Miecznikowski, and Eddy Observed values (normalized) Residual values (normalized) Fig 8 Before normalization versus after normalization plot Table 5 Before versus after normalization contingency table

23 Systematic Microarray Data Variation 23 5 Discussion The introduction of the cdna microarray by Schena, et al (1995) and the analytical procedure of Shalon, et al (1996) are significant contributions made to the field of genetics There has been substantial statistical research recognizing the need to normalize the data in order to remove systematic variation due to the pins, with discussion noting the need for further consideration of additional factors causing systematic variation (Smyth and Speed, 2003) Through this exploratory work, we have done just that demonstrated the existence of other sources of significant systematic variation that go unaccounted for in standard analyses Specifically, there exists systematic variation due to the location on the chip, the location on the plate, and among the batch of plates It is necessary to remove this extra variation not only due to its effect on the size of the differential gene expression but, more significantly, because of the varied outcomes with respect to gene selection It is necessary to have a standard approach to remove the systematic variation within the chip First, we perform the linearization technique described in Section 31 This removes a great deal of the variation in the chip and balances the Cy3 and Cy5 intensities within the chip Next, we suggest removing the variation in a step-by-step fashion as discussed in Section 33 Summarizing, we first remove the variation due to chip row location, and then with respect to chip column Next, we successively remove the effect due to the plate row, plate column and plate number We compare our model to the lowess (f = 40%) approach of Yang et al (2001) Both procedures account for a significant amount of variation with the lowess model reducing the variation more than our model; however, we feel that our approach is easily explained and understood given the context of the experiment and data collection procedure

24 24 Sellers, Miecznikowski, and Eddy Acknowledgements The authors wish to thank Eleanor Fiengold for her helpful and insightful discussion on this paper and related work regarding microarray analysis, as well as the referees for their astute and insightful comments We also thank Karoly Mirnics for providing the data Kimberly Sellers and Jeffrey Miecznikowski were supported in part by the NSF-VIGRE program, grant number DMS References Brown, PO, Botstein, D (1999) Exploring the new world of the genome with DNA microarrays Nature Genetics Supplement Buja, A, Hastie, T, Tibshirani, R (1989) Linear Smoothers and Additive Models Annals of Statistics Cheung, V, Morley, M, Aguilar, F, Massimi, A, Kucherlapati, R Childs, G (1999) Making and reading microarrays Nature Genetics Supplement Cleveland, WS (1979) Robust locally weighted regression and smoothing scatterplots Journal of the American Statistical Association Cleveland, WS (1981) LOWESS: A program for smoothing scatterplots by robust locally weighted regression The American Statistician Cleveland, WS, Devlin, SJ (1988) Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting Journal of the American Statistical Association Colantuoni, C, Henry, G, Zeger, S, Pevsner, J (2002) SNOMAD (Standardization and Normalization of Microarray Data): web-accessible gene expression data analysis Bioinformatics DeRisi, J, Penland, L, Brown, PO Bittner, ML (1996) Use of a cdna microarray to analyse gene expression patterns in human cancer Nature Genetics

25 Systematic Microarray Data Variation 25 Duggan, DJ, Bittner, M, Chen, Y, Meltzer, P, Trent, JM (1999) Expression profiling using cdna microarrays Nature Genetics Supplement Durbin, BP, Hardin, JS, Hawkins, DM Rocke, DM (2002) A variance-stabilizing transformation for gene-expression microarray data Bioinformatics 18 S105 S110 Dudoit, S, Yang, YH, Callow, MJ Speed, TP (2000) Statistical methods for identifying differentially expressed genes in relicated cdna microarray experiments, Technical report 578, University of California at Berkeley, 1 38 Efron, B, Tibshirani, R, Storey, JD Tusher, V (2001) Empirical Bayes Analysis of a Microarray Experiment Journal of the American Statistical Association Finkelstein, DB, Ewing, R, Gollub, J, Sterky, F, Somerville, S, Cherry, JM (2001) Iterative linear regression by sector In: Methods of Microarray Data Analysis Papers from CAMDA 2000 eds SM Lin and KF Johnson, Kluwer Academic, Hamadeh, H Afshari, CA (2000) Gene Chips and Functional Genomics American Scientist Hamilton, L (1992) Regression with Graphics: A Second Course in Applied Statistics California: Duxbury Press Huber, PJ, (1964) Robust estimation of a location parameter Annals of Mathematical Statistics Huber, W, von Heydebreck, A, Sültmann, H, Poustka, A, Vingron, M (2002) Variance stabilization applied to microarray data calibration and to the quantification of differential expression Bioinformatics 18 S96 S104 Kerr, MK, Afshari, CA, Bennett, L, Bushel, P, Martinez, J, Walker, J Churchill, GA (2001) Statistical Analysis of a Gene Expression Microarray Experiment with Replication Statistica Sinica to appear

26 26 Sellers, Miecznikowski, and Eddy Kerr, MK, Martin, M Churchill, GA (2000) Analysis of Variance for Gene Expression Microarray Data Journal of Computational Biology Lockhart, DJ, Winzeler, EA (2000) Genomics, gene expression and DNA arrays Nature Mallows, CL (1980) Some Theory of Nonlinear Smoothers Annals of Statistics Rocke, DM, Durbin, B (2001) A model for measurement error for gene expression arrays Journal of Computational Biology Schena, M, Shalon, D, Davis, RW Brown, PO (1995) Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray Science Shalon, D, Smith, SJ Brown, PO (1996) A DNA Microarray System for Analyzing Complex DNA Samples Using Two-color Fluorescent Probe Hybridization Genome Research Smyth, GK, Speed, T (2003) Normalization of cdna microarray data METHODS: Selecting Candidate Genes from DNA Array Screens: Application to Neuroscience, D Carter (ed) Yang, YH, Dudoit, S, Luu, P Speed, TP (2001) Normalization for cdna Microarray Data Microarrays: Optical Technologies and Informatics, ML Bittner, Y Chen, AN Dorsel, and ER Dougherty (eds), Proceedings of SPIE Yang, YH, Buckley, M, Dudoit, S Speed, TP (2002a) Comparison of Methods for Image Analysis on cdna Microarray Data Journal of Computational and Graphical Statistics Yang, YH, Dudoit, S, Luu, P, Lin, DM, Peng, V, Ngai, J, Speed, TP (2002b) Normalization for cdna microarray data: a robust composite method addressing single and multiple slide systematic variation Nucleic Acids Research 30(4): e15 Yue, H, Eastman, PS, Wang, B, Minor, J, Doctolero, MH, Nuttall, RL, Stack, R, Becker, JW, Montgomery, JR, Vainer, M Johnston, R (2001) An evaluation of the performance of cdna microarrays for detecting changes in global mrna expression Nucleic Acids Research 29 e41

27 Systematic Microarray Data Variation 27 A List of Tables Table 1: Linearization results for â, ˆb, ĉ using Equation (2) versus Equations (3) and (4), respectively Table 2: Approximate ANOVA table representing effect of all factors on log-differential expression These terms are added sequentially (first to last) Table 3: Approximate ANOVA table representing nested effect of chip row and chip column within pin, and all other factors on log-differential expression These terms are added sequentially (first to last) Table 4: ANOVA result (general model) considering lowess linearization proposed by Yang, et al (2001) These terms are added sequentially (first to last) Table 5: Before versus after normalization contingency table B List of Figures Figure 1: Chip representation: black denotes location of (a) unused spaces, (b) unused spaces and spots from control genes, (c) all spots removed from analysis (unused, control, and missing) Figure 2: Boxplot comparison of control versus experimental data The data values shown represent the difference of the logged intensities, log(p 1 ) log(p 2 ), in Equation (1) Figure 3: log(p 1 ) - log(p 2 ): (a) true chip representation, (b) ranked chip representation Figure 4: Microarray example before and after transformation process: (a) log intensity versus log intensity plot, (b) mean log intensity (A) versus log intensity ratio (M) plot See Equation (4) results in Table 1 for â,ˆb,ĉ values

28 28 Sellers, Miecznikowski, and Eddy Figure 5: Plot of coefficient values for each of the factor models referenced in Table 2 From (top) left to right: Plate Row, Plate Column, Chip Row, Chip Column, Plate Number Figure 6: Possible outcomes for significance before versus after correction The notation, (A, B), represents the significance outcome, where A denotes the significance or non-significance determined before correction, while B denotes the significance or nonsignificance detected after correction NS and S denote non-significance and significance, respectively Figure 4: Plots demonstrating the affect of a changing cutoff value (a) Before normalization cutoff changes, while after normalization cutoff remains fixed (b) After normalization cutoff changes, while before normalization cutoff remains fixed Figure 8: Before normalization versus after normalization plot C Supplementary Material The following material is intended for inclusion on the website, wwwbiostatisticsoupjournalsorg, as supplementary material to the above paper D Relationships to spot location We consider the data in relation to the following factors: pin, plate row, plate column, chip row, chip column, and plate number It is worth noting, however, that all of these factors can be expressed by their location on the microarray chip, i = 1,, 10200, where the 24 unused locations occur for i = 2545,, 2550; 5095,, 5100; 7645,, 7650; and 10195,, Because the associated spots are unused and, therefore, not considered in the analyses, we do not concern ourselves with these locations in the formulae below To ease the computations, we

29 Systematic Microarray Data Variation 29 define i := i(mod 2550); thus the following relationships will be expressed in terms of i and i The pin or quadrant value, q i, is easily related to the location of the spot: i q i = 2550 We can explain the relationship of the ordered spots to their location on a plate with respect to plate row (denoted j) and plate column (denoted k) The plate row location is described as 2 [i(mod 2550)](mod 24) 6 1 = 2 i (mod 24) 6 1, for i 0(mod 24) j i =, 7, for i = 0(mod 24) if 1 i 5100, and 2 [i(mod 2550)](mod 24) 6 = 2 i (mod 24) 6, for i 0(mod 24) j i = 8, for i = 0(mod 24), if 5101 i For plate column, 2{[i(mod 2550)](mod 6) 1} + 1 = 2[i (mod 6) 1] + 1, for i (mod 6) 0 k i = 11, for i (mod 6) = 0, if 1 i 2550 or 5101 i 7650, and 2{[i(mod 2550)](mod 6)} = 2[i (mod 6)], for i (mod 6) 0 k i = 12, for i (mod 6) = 0, if 2551 i 5100 or 7650 i The chip row and chip column locations (denoted r i and c i, respectively) are easily described in terms of the array location The function, i(mod 2550) 50 = i 50 r i = 51 + i(mod 2550) 50 = 51 + i 50, for i = 1,, 5100, for i = 5101,, 10200,

30 30 Sellers, Miecznikowski, and Eddy describes the relationship between i, i and chip row, while i(mod 100), for i 0(mod 100) c i = 100, for i = 0(mod 100) relates i to chip column Let n i denote the plate number on which the spot is located (n i = 1,, 106) Thus, the plate number is also a function in terms of the spot location in the array representation in that i(mod 2550) i n i = = E Results for Microarray Chip 2 We first remove from analysis the data corresponding to the location of unused spaces, controls, and missing data Particularly, the controls are removed because the spread of the control data is significantly larger than that of the experimental data with a substantial number of outliers (see Figure 9), hence they do not provide a reasonable measure of comparison for our analysis From viewing the chip representations of the actual differential log expression (Figure 10a) and particularly its associated ranked version (Figure 10b), there does not appear to be any significant trend that is detectable by the eye Nonetheless, we proceed with further analysis to confirm this hypothesis First, by plotting the log(p 1 ) versus log(p 2 ) and A versus M scatterplots, we witness an interesting form of clustering in that the majority of the log(p 1 ) values are between 4 and 55 while the majority of log(p 2 ) values cluster between 45 and 6; meanwhile, the remaining values are more scattered in a somewhat curvilinear trend; see Figures 11a and b By performing the linearization technique described in Equation (4), this curvature is slightly diminished and the trend is thus generally linear with a slope of 1 in Figure 11a (equivalently, the trend has a slope of 0 in Figure 11b) Note that the estimates for â, ˆb, ĉ cluster between Pins 1 and 3, and Pins 2

31 Systematic Microarray Data Variation Controls Experimental data Fig 9 Boxplot comparison of control versus experimental data for Chip 2 The data values shown represent the difference of the logged intensities, log(p 1) log(p 2) (a) (b) Fig 10 log(p 1) - log(p 2) for Chip 2: (a) true chip representation, (b) ranked chip representation

32 32 Sellers, Miecznikowski, and Eddy Table 6 Chip 2 linearization results for â, ˆb, ĉ using Equation (2) versus Equations (3) and (4), respectively Equation â ˆb ĉ (2) (3) (4) and 4; see Table 6 There appears to be a significant change due to the proposed linearization process for Chip 2 When the complete model is considered, only the linearization and chip row factors are significant, while the remaining factors have small mean sums of squares values; see Table 7 for the approximate analysis of variance table Thus, we see that the systematic variation within this chip is due largely to the linearization, and chip row effects As before, the nested model in Table 8 does not show any clear advantages over the general model, therefore we maintain use of the proposed general model shown in Table 7 In comparison with the lowess model of Table 9, we find that the lowess model appears to perform better, accounting for a MSE approximately 20% smaller than that for our proposed model The model s results are not data-dependent through consideration of training and test data The resulting linearization and analysis of variance results are approximately equal to that shown above with all of the data used in the analysis, and the mean-square error for the test data is

33 Systematic Microarray Data Variation 33 Original values for log(p1) Original values for log(p2) Normalized values for log(p1) Normalized values for log(p2) Pin 1 Pin 2 Pin 3 Pin 4 (a) Original values for A Original values for M Normalized values for A Normalized values for M Pin 1 Pin 2 Pin 3 Pin 4 (b) Fig 11 Microarray example before and after transformation process: (a) log intensity versus log intensity plot, (b) mean log intensity (A) versus log intensity ratio (M) plot See results from Equation (4) in Table 1 for â,ˆb,ĉ values

34 34 Sellers, Miecznikowski, and Eddy Table 7 Approximate ANOVA table representing effect of all factors on log-differential expression for Chip 2 These terms are added sequentially (first to last) Df Sum of Sq Mean Sq F Value Pr(F) Pinwise Linearization Plate Row Plate Column Chip Row Chip Column Plate Number Residuals Grand Total , approximately 10 percent greater than that for the training data In searching for any spatial relationships within the various factors in the general model, there do not appear to be any significant trends Although the first four plate row coefficients have an increasing trend while the last four successively decrease, it is difficult to ascertain a trend given the small number of coefficients Similarly, the plot of chip column coefficients is interesting in that there is a slight quadratic trend over column location Further, the majority of coefficients fall within [-007, 007] with two outlying coefficient values corresponding to the 59th and 73rd columns of the chip Refer to Figure 12 for a spatial representation of the coefficients corresponding to pin, plate row, plate column, chip row, chip column, and plate number Finally, in consideration of gene selection, the before versus after normalization plot for Chip 2 is provided in Figure 13 with its associated contingency table, Table 10, providing the number

Bioconductor Project Working Papers

Bioconductor Project Working Papers Bioconductor Project Working Papers Bioconductor Project Year 2004 Paper 6 Error models for microarray intensities Wolfgang Huber Anja von Heydebreck Martin Vingron Department of Molecular Genome Analysis,

More information

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

DIRECT VERSUS INDIRECT DESIGNS FOR edna MICROARRAY EXPERIMENTS

DIRECT VERSUS INDIRECT DESIGNS FOR edna MICROARRAY EXPERIMENTS Sankhyā : The Indian Journal of Statistics Special issue in memory of D. Basu 2002, Volume 64, Series A, Pt. 3, pp 706-720 DIRECT VERSUS INDIRECT DESIGNS FOR edna MICROARRAY EXPERIMENTS By TERENCE P. SPEED

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

cdna Microarray Analysis

cdna Microarray Analysis cdna Microarray Analysis with BioConductor packages Nolwenn Le Meur Copyright 2007 Outline Data acquisition Pre-processing Quality assessment Pre-processing background correction normalization summarization

More information

SPOTTED cdna MICROARRAYS

SPOTTED cdna MICROARRAYS SPOTTED cdna MICROARRAYS Spot size: 50um - 150um SPOTTED cdna MICROARRAYS Compare the genetic expression in two samples of cells PRINT cdna from one gene on each spot SAMPLES cdna labelled red/green e.g.

More information

Estimation of Transformations for Microarray Data Using Maximum Likelihood and Related Methods

Estimation of Transformations for Microarray Data Using Maximum Likelihood and Related Methods Estimation of Transformations for Microarray Data Using Maximum Likelihood and Related Methods Blythe Durbin, Department of Statistics, UC Davis, Davis, CA 95616 David M. Rocke, Department of Applied Science,

More information

Biochip informatics-(i)

Biochip informatics-(i) Biochip informatics-(i) : biochip normalization & differential expression Ju Han Kim, M.D., Ph.D. SNUBI: SNUBiomedical Informatics http://www.snubi snubi.org/ Biochip Informatics - (I) Biochip basics Preprocessing

More information

changes in gene expression, we developed and tested several models. Each model was

changes in gene expression, we developed and tested several models. Each model was Additional Files Additional File 1 File format: PDF Title: Experimental design and linear models Description: This additional file describes in detail the experimental design and linear models used to

More information

Design of Microarray Experiments. Xiangqin Cui

Design of Microarray Experiments. Xiangqin Cui Design of Microarray Experiments Xiangqin Cui Experimental design Experimental design: is a term used about efficient methods for planning the collection of data, in order to obtain the maximum amount

More information

Improving the identification of differentially expressed genes in cdna microarray experiments

Improving the identification of differentially expressed genes in cdna microarray experiments Improving the identification of differentially expressed genes in cdna microarray experiments Databionics Research Group University of Marburg 33 Marburg, Germany Alfred Ultsch Abstract. The identification

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1) Authored by: Sarah Burke, PhD Version 1: 31 July 2017 Version 1.1: 24 October 2017 The goal of the STAT T&E COE

More information

The Model Building Process Part I: Checking Model Assumptions Best Practice

The Model Building Process Part I: Checking Model Assumptions Best Practice The Model Building Process Part I: Checking Model Assumptions Best Practice Authored by: Sarah Burke, PhD 31 July 2017 The goal of the STAT T&E COE is to assist in developing rigorous, defensible test

More information

Diagnostics and Remedial Measures

Diagnostics and Remedial Measures Diagnostics and Remedial Measures Yang Feng http://www.stat.columbia.edu/~yangfeng Yang Feng (Columbia University) Diagnostics and Remedial Measures 1 / 72 Remedial Measures How do we know that the regression

More information

Interaction effects for continuous predictors in regression modeling

Interaction effects for continuous predictors in regression modeling Interaction effects for continuous predictors in regression modeling Testing for interactions The linear regression model is undoubtedly the most commonly-used statistical model, and has the advantage

More information

A new approach to intensity-dependent normalization of two-channel microarrays

A new approach to intensity-dependent normalization of two-channel microarrays Biostatistics (2007), 8, 1, pp. 128 139 doi:10.1093/biostatistics/kxj038 Advance Access publication on April 24, 2006 A new approach to intensity-dependent normalization of two-channel microarrays ALAN

More information

REPLICATED MICROARRAY DATA

REPLICATED MICROARRAY DATA Statistica Sinica 1(), 31-46 REPLICATED MICROARRAY DATA Ingrid Lönnstedt and Terry Speed Uppsala University, University of California, Berkeley and Walter and Eliza Hall Institute Abstract: cdna microarrays

More information

Abstract. comment reviews reports deposited research refereed research interactions information

Abstract.   comment reviews reports deposited research refereed research interactions information http://genomebiology.com/2002/3/5/research/0022.1 Research How many replicates of arrays are required to detect gene expression changes in microarray experiments? A mixture model approach Wei Pan*, Jizhen

More information

Data Preprocessing. Data Preprocessing

Data Preprocessing. Data Preprocessing Data Preprocessing 1 Data Preprocessing Normalization: the process of removing sampleto-sample variations in the measurements not due to differential gene expression. Bringing measurements from the different

More information

Expression arrays, normalization, and error models

Expression arrays, normalization, and error models 1 Epression arrays, normalization, and error models There are a number of different array technologies available for measuring mrna transcript levels in cell populations, from spotted cdna arrays to in

More information

Practical Statistics for the Analytical Scientist Table of Contents

Practical Statistics for the Analytical Scientist Table of Contents Practical Statistics for the Analytical Scientist Table of Contents Chapter 1 Introduction - Choosing the Correct Statistics 1.1 Introduction 1.2 Choosing the Right Statistical Procedures 1.2.1 Planning

More information

ANALYSIS OF DYNAMIC PROTEIN EXPRESSION DATA

ANALYSIS OF DYNAMIC PROTEIN EXPRESSION DATA REVSTAT Statistical Journal Volume 3, Number 2, November 2005, 99 111 ANALYSIS OF DYNAMIC PROTEIN EXPRESSION DATA Authors: Klaus Jung Department of Statistics, University of Dortmund, Germany (klaus.jung@uni-dortmund.de)

More information

Seminar Microarray-Datenanalyse

Seminar Microarray-Datenanalyse Seminar Microarray- Normalization Hans-Ulrich Klein Christian Ruckert Institut für Medizinische Informatik WWU Münster SS 2011 Organisation 1 09.05.11 Normalisierung 2 10.05.11 Bestimmen diff. expr. Gene,

More information

Regression Model In The Analysis Of Micro Array Data-Gene Expression Detection

Regression Model In The Analysis Of Micro Array Data-Gene Expression Detection Jamal Fathima.J.I 1 and P.Venkatesan 1. Research Scholar -Department of statistics National Institute For Research In Tuberculosis, Indian Council For Medical Research,Chennai,India,.Department of statistics

More information

Microarray Preprocessing

Microarray Preprocessing Microarray Preprocessing Normaliza$on Normaliza$on is needed to ensure that differences in intensi$es are indeed due to differen$al expression, and not some prin$ng, hybridiza$on, or scanning ar$fact.

More information

Swarthmore Honors Exam 2012: Statistics

Swarthmore Honors Exam 2012: Statistics Swarthmore Honors Exam 2012: Statistics 1 Swarthmore Honors Exam 2012: Statistics John W. Emerson, Yale University NAME: Instructions: This is a closed-book three-hour exam having six questions. You may

More information

Identifying Bio-markers for EcoArray

Identifying Bio-markers for EcoArray Identifying Bio-markers for EcoArray Ashish Bhan, Keck Graduate Institute Mustafa Kesir and Mikhail B. Malioutov, Northeastern University February 18, 2010 1 Introduction This problem was presented by

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

Graphical Presentation of a Nonparametric Regression with Bootstrapped Confidence Intervals

Graphical Presentation of a Nonparametric Regression with Bootstrapped Confidence Intervals Graphical Presentation of a Nonparametric Regression with Bootstrapped Confidence Intervals Mark Nicolich & Gail Jorgensen Exxon Biomedical Science, Inc., East Millstone, NJ INTRODUCTION Parametric regression

More information

Low-Level Analysis of High- Density Oligonucleotide Microarray Data

Low-Level Analysis of High- Density Oligonucleotide Microarray Data Low-Level Analysis of High- Density Oligonucleotide Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley UC Berkeley Feb 23, 2004 Outline

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Regression. Marc H. Mehlman University of New Haven

Regression. Marc H. Mehlman University of New Haven Regression Marc H. Mehlman marcmehlman@yahoo.com University of New Haven the statistician knows that in nature there never was a normal distribution, there never was a straight line, yet with normal and

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

A variance-stabilizing transformation for gene-expression microarray data

A variance-stabilizing transformation for gene-expression microarray data BIOINFORMATICS Vol. 18 Suppl. 1 00 Pages S105 S110 A variance-stabilizing transformation for gene-expression microarray data B. P. Durbin 1, J. S. Hardin, D. M. Hawins 3 and D. M. Roce 4 1 Department of

More information

Chapter 6 The 2 k Factorial Design Solutions

Chapter 6 The 2 k Factorial Design Solutions Solutions from Montgomery, D. C. (004) Design and Analysis of Experiments, Wiley, NY Chapter 6 The k Factorial Design Solutions 6.. A router is used to cut locating notches on a printed circuit board.

More information

Experimental Design. Experimental design. Outline. Choice of platform Array design. Target samples

Experimental Design. Experimental design. Outline. Choice of platform Array design. Target samples Experimental Design Credit for some of today s materials: Jean Yang, Terry Speed, and Christina Kendziorski Experimental design Choice of platform rray design Creation of probes Location on the array Controls

More information

SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION. University of Minnesota

SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION. University of Minnesota Submitted to the Annals of Statistics arxiv: math.pr/0000000 SUPPLEMENT TO PARAMETRIC OR NONPARAMETRIC? A PARAMETRICNESS INDEX FOR MODEL SELECTION By Wei Liu and Yuhong Yang University of Minnesota In

More information

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001).

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). Chapter 3: Statistical methods for estimation and testing Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). Chapter 3: Statistical methods for estimation and testing Key reference:

More information

The miss rate for the analysis of gene expression data

The miss rate for the analysis of gene expression data Biostatistics (2005), 6, 1,pp. 111 117 doi: 10.1093/biostatistics/kxh021 The miss rate for the analysis of gene expression data JONATHAN TAYLOR Department of Statistics, Stanford University, Stanford,

More information

Prediction Intervals in the Presence of Outliers

Prediction Intervals in the Presence of Outliers Prediction Intervals in the Presence of Outliers David J. Olive Southern Illinois University July 21, 2003 Abstract This paper presents a simple procedure for computing prediction intervals when the data

More information

Single gene analysis of differential expression

Single gene analysis of differential expression Single gene analysis of differential expression Giorgio Valentini DSI Dipartimento di Scienze dell Informazione Università degli Studi di Milano valentini@dsi.unimi.it Comparing two conditions Each condition

More information

Improved Statistical Tests for Differential Gene Expression by Shrinking Variance Components Estimates

Improved Statistical Tests for Differential Gene Expression by Shrinking Variance Components Estimates Improved Statistical Tests for Differential Gene Expression by Shrinking Variance Components Estimates September 4, 2003 Xiangqin Cui, J. T. Gene Hwang, Jing Qiu, Natalie J. Blades, and Gary A. Churchill

More information

Probabilistic Inference for Multiple Testing

Probabilistic Inference for Multiple Testing This is the title page! This is the title page! Probabilistic Inference for Multiple Testing Chuanhai Liu and Jun Xie Department of Statistics, Purdue University, West Lafayette, IN 47907. E-mail: chuanhai,

More information

Introduction to Linear regression analysis. Part 2. Model comparisons

Introduction to Linear regression analysis. Part 2. Model comparisons Introduction to Linear regression analysis Part Model comparisons 1 ANOVA for regression Total variation in Y SS Total = Variation explained by regression with X SS Regression + Residual variation SS Residual

More information

Nonlinear Regression. Summary. Sample StatFolio: nonlinear reg.sgp

Nonlinear Regression. Summary. Sample StatFolio: nonlinear reg.sgp Nonlinear Regression Summary... 1 Analysis Summary... 4 Plot of Fitted Model... 6 Response Surface Plots... 7 Analysis Options... 10 Reports... 11 Correlation Matrix... 12 Observed versus Predicted...

More information

Lecture 5 Processing microarray data

Lecture 5 Processing microarray data Lecture 5 Processin microarray data (1)Transform the data into a scale suitable for analysis ()Remove the effects of systematic and obfuscatin sources of variation (3)Identify discrepant observations Preprocessin

More information

Statistical Methods for Analysis of Genetic Data

Statistical Methods for Analysis of Genetic Data Statistical Methods for Analysis of Genetic Data Christopher R. Cabanski A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements

More information

Optimal normalization of DNA-microarray data

Optimal normalization of DNA-microarray data Optimal normalization of DNA-microarray data Daniel Faller 1, HD Dr. J. Timmer 1, Dr. H. U. Voss 1, Prof. Dr. Honerkamp 1 and Dr. U. Hobohm 2 1 Freiburg Center for Data Analysis and Modeling 1 F. Hoffman-La

More information

Sleep data, two drugs Ch13.xls

Sleep data, two drugs Ch13.xls Model Based Statistics in Biology. Part IV. The General Linear Mixed Model.. Chapter 13.3 Fixed*Random Effects (Paired t-test) ReCap. Part I (Chapters 1,2,3,4), Part II (Ch 5, 6, 7) ReCap Part III (Ch

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

MS-C1620 Statistical inference

MS-C1620 Statistical inference MS-C1620 Statistical inference 10 Linear regression III Joni Virta Department of Mathematics and Systems Analysis School of Science Aalto University Academic year 2018 2019 Period III - IV 1 / 32 Contents

More information

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics

TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics TABLES AND FORMULAS FOR MOORE Basic Practice of Statistics Exploring Data: Distributions Look for overall pattern (shape, center, spread) and deviations (outliers). Mean (use a calculator): x = x 1 + x

More information

Multiplicative background correction for spotted. microarrays to improve reproducibility

Multiplicative background correction for spotted. microarrays to improve reproducibility Multiplicative background correction for spotted microarrays to improve reproducibility DABAO ZHANG,, MIN ZHANG, MARTIN T. WELLS, March 12, 2006 Department of Statistics, Purdue University, West Lafayette,

More information

Use of Agilent Feature Extraction Software (v8.1) QC Report to Evaluate Microarray Performance

Use of Agilent Feature Extraction Software (v8.1) QC Report to Evaluate Microarray Performance Use of Agilent Feature Extraction Software (v8.1) QC Report to Evaluate Microarray Performance Anthea Dokidis Glenda Delenstarr Abstract The performance of the Agilent microarray system can now be evaluated

More information

Statistical analysis of microarray data: a Bayesian approach

Statistical analysis of microarray data: a Bayesian approach Biostatistics (003), 4, 4,pp. 597 60 Printed in Great Britain Statistical analysis of microarray data: a Bayesian approach RAPHAEL GTTARD University of Washington, Department of Statistics, Box 3543, Seattle,

More information

Microarray Data Analysis - II. FIOCRUZ Bioinformatics Workshop 6 June, 2001

Microarray Data Analysis - II. FIOCRUZ Bioinformatics Workshop 6 June, 2001 Microarray Data Analysis - II FIOCRUZ Bioinformatics Workshop 6 June, 2001 Challenges in Microarray Data Analysis Spot Identification and Quantitation. Normalization of data from each experiment. Identification

More information

Topic 4: Orthogonal Contrasts

Topic 4: Orthogonal Contrasts Topic 4: Orthogonal Contrasts ANOVA is a useful and powerful tool to compare several treatment means. In comparing t treatments, the null hypothesis tested is that the t true means are all equal (H 0 :

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca

More information

Rank parameters for Bland Altman plots

Rank parameters for Bland Altman plots Rank parameters for Bland Altman plots Roger B. Newson May 2, 8 Introduction Bland Altman plots were introduced by Altman and Bland (983)[] and popularized by Bland and Altman (986)[2]. Given N bivariate

More information

Topics on statistical design and analysis. of cdna microarray experiment

Topics on statistical design and analysis. of cdna microarray experiment Topics on statistical design and analysis of cdna microarray experiment Ximin Zhu A Dissertation Submitted to the University of Glasgow for the degree of Doctor of Philosophy Department of Statistics May

More information

Soil Phosphorus Discussion

Soil Phosphorus Discussion Solution: Soil Phosphorus Discussion Summary This analysis is ambiguous: there are two reasonable approaches which yield different results. Both lead to the conclusion that there is not an independent

More information

Sociology 6Z03 Review II

Sociology 6Z03 Review II Sociology 6Z03 Review II John Fox McMaster University Fall 2016 John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 1 / 35 Outline: Review II Probability Part I Sampling Distributions Probability

More information

WISE Regression/Correlation Interactive Lab. Introduction to the WISE Correlation/Regression Applet

WISE Regression/Correlation Interactive Lab. Introduction to the WISE Correlation/Regression Applet WISE Regression/Correlation Interactive Lab Introduction to the WISE Correlation/Regression Applet This tutorial focuses on the logic of regression analysis with special attention given to variance components.

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Bradley Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org

More information

1 A Review of Correlation and Regression

1 A Review of Correlation and Regression 1 A Review of Correlation and Regression SW, Chapter 12 Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then

More information

RCB - Example. STA305 week 10 1

RCB - Example. STA305 week 10 1 RCB - Example An accounting firm wants to select training program for its auditors who conduct statistical sampling as part of their job. Three training methods are under consideration: home study, presentations

More information

Fuzzy Clustering of Gene Expression Data

Fuzzy Clustering of Gene Expression Data Fuzzy Clustering of Gene Data Matthias E. Futschik and Nikola K. Kasabov Department of Information Science, University of Otago P.O. Box 56, Dunedin, New Zealand email: mfutschik@infoscience.otago.ac.nz,

More information

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure.

Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple Regression procedure. STATGRAPHICS Rev. 9/13/213 Calibration Models Summary... 1 Data Input... 3 Analysis Summary... 5 Analysis Options... 7 Plot of Fitted Model... 9 Predicted Values... 1 Confidence Intervals... 11 Observed

More information

Non-specific filtering and control of false positives

Non-specific filtering and control of false positives Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview

More information

Correlation & Simple Regression

Correlation & Simple Regression Chapter 11 Correlation & Simple Regression The previous chapter dealt with inference for two categorical variables. In this chapter, we would like to examine the relationship between two quantitative variables.

More information

This document contains 3 sets of practice problems.

This document contains 3 sets of practice problems. P RACTICE PROBLEMS This document contains 3 sets of practice problems. Correlation: 3 problems Regression: 4 problems ANOVA: 8 problems You should print a copy of these practice problems and bring them

More information

Supplementary Information

Supplementary Information Supplementary Information A versatile genome-scale PCR-based pipeline for high-definition DNA FISH Magda Bienko,, Nicola Crosetto,, Leonid Teytelman, Sandy Klemm, Shalev Itzkovitz & Alexander van Oudenaarden,,

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Lecture Outline. Biost 518 Applied Biostatistics II. Choice of Model for Analysis. Choice of Model. Choice of Model. Lecture 10: Multiple Regression:

Lecture Outline. Biost 518 Applied Biostatistics II. Choice of Model for Analysis. Choice of Model. Choice of Model. Lecture 10: Multiple Regression: Biost 518 Applied Biostatistics II Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture utline Choice of Model Alternative Models Effect of data driven selection of

More information

INFERENCE FOR REGRESSION

INFERENCE FOR REGRESSION CHAPTER 3 INFERENCE FOR REGRESSION OVERVIEW In Chapter 5 of the textbook, we first encountered regression. The assumptions that describe the regression model we use in this chapter are the following. We

More information

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data

A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data A Sparse Solution Approach to Gene Selection for Cancer Diagnosis Using Microarray Data Yoonkyung Lee Department of Statistics The Ohio State University http://www.stat.ohio-state.edu/ yklee May 13, 2005

More information

Examination paper for TMA4255 Applied statistics

Examination paper for TMA4255 Applied statistics Department of Mathematical Sciences Examination paper for TMA4255 Applied statistics Academic contact during examination: Anna Marie Holand Phone: 951 38 038 Examination date: 16 May 2015 Examination time

More information

Prediction of Bike Rental using Model Reuse Strategy

Prediction of Bike Rental using Model Reuse Strategy Prediction of Bike Rental using Model Reuse Strategy Arun Bala Subramaniyan and Rong Pan School of Computing, Informatics, Decision Systems Engineering, Arizona State University, Tempe, USA. {bsarun, rong.pan}@asu.edu

More information

arxiv: v1 [stat.me] 14 Jan 2019

arxiv: v1 [stat.me] 14 Jan 2019 arxiv:1901.04443v1 [stat.me] 14 Jan 2019 An Approach to Statistical Process Control that is New, Nonparametric, Simple, and Powerful W.J. Conover, Texas Tech University, Lubbock, Texas V. G. Tercero-Gómez,Tecnológico

More information

9 Correlation and Regression

9 Correlation and Regression 9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

More information

10 Model Checking and Regression Diagnostics

10 Model Checking and Regression Diagnostics 10 Model Checking and Regression Diagnostics The simple linear regression model is usually written as i = β 0 + β 1 i + ɛ i where the ɛ i s are independent normal random variables with mean 0 and variance

More information

Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA

Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA Technologie w skali genomowej 2/ Algorytmiczne i statystyczne aspekty sekwencjonowania DNA Expression analysis for RNA-seq data Ewa Szczurek Instytut Informatyki Uniwersytet Warszawski 1/35 The problem

More information

BMC Bioinformatics. Open Access. Abstract. Background Microarray technology has become a useful tool for quantitatively.

BMC Bioinformatics. Open Access. Abstract. Background Microarray technology has become a useful tool for quantitatively. BMC Bioinformatics BioMed Central Methodology article A robust two-way semi-linear model for normalization of cdna microarray data Deli Wang, Jian Huang, Hehuang Xie 3, Liliana Manzella 3 and Marcelo Bento

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /8/2016 1/38

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /8/2016 1/38 BIO5312 Biostatistics Lecture 11: Multisample Hypothesis Testing II Dr. Junchao Xia Center of Biophysics and Computational Biology Fall 2016 11/8/2016 1/38 Outline In this lecture, we will continue to

More information

Discovering Correlation in Data. Vinh Nguyen Research Fellow in Data Science Computing and Information Systems DMD 7.

Discovering Correlation in Data. Vinh Nguyen Research Fellow in Data Science Computing and Information Systems DMD 7. Discovering Correlation in Data Vinh Nguyen (vinh.nguyen@unimelb.edu.au) Research Fellow in Data Science Computing and Information Systems DMD 7.14 Discovering Correlation Why is correlation important?

More information

T H E J O U R N A L O F C E L L B I O L O G Y

T H E J O U R N A L O F C E L L B I O L O G Y T H E J O U R N A L O F C E L L B I O L O G Y Supplemental material Breker et al., http://www.jcb.org/cgi/content/full/jcb.201301120/dc1 Figure S1. Single-cell proteomics of stress responses. (a) Using

More information

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014

Ridge Regression. Summary. Sample StatFolio: ridge reg.sgp. STATGRAPHICS Rev. 10/1/2014 Ridge Regression Summary... 1 Data Input... 4 Analysis Summary... 5 Analysis Options... 6 Ridge Trace... 7 Regression Coefficients... 8 Standardized Regression Coefficients... 9 Observed versus Predicted...

More information

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model Checking/Diagnostics Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics The session is a continuation of a version of Section 11.3 of MMD&S. It concerns

More information

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics

Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model Checking/Diagnostics Regression Analysis V... More Model Building: Including Qualitative Predictors, Model Searching, Model "Checking"/Diagnostics The session is a continuation of a version of Section 11.3 of MMD&S. It concerns

More information

MULTIPLE REGRESSION METHODS

MULTIPLE REGRESSION METHODS DEPARTMENT OF POLITICAL SCIENCE AND INTERNATIONAL RELATIONS Posc/Uapp 816 MULTIPLE REGRESSION METHODS I. AGENDA: A. Residuals B. Transformations 1. A useful procedure for making transformations C. Reading:

More information

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables

Regression Analysis. Regression: Methodology for studying the relationship among two or more variables Regression Analysis Regression: Methodology for studying the relationship among two or more variables Two major aims: Determine an appropriate model for the relationship between the variables Predict the

More information

Remedial Measures, Brown-Forsythe test, F test

Remedial Measures, Brown-Forsythe test, F test Remedial Measures, Brown-Forsythe test, F test Dr. Frank Wood Frank Wood, fwood@stat.columbia.edu Linear Regression Models Lecture 7, Slide 1 Remedial Measures How do we know that the regression function

More information

Chapter 1 Statistical Inference

Chapter 1 Statistical Inference Chapter 1 Statistical Inference causal inference To infer causality, you need a randomized experiment (or a huge observational study and lots of outside information). inference to populations Generalizations

More information

Correlation. Tests of Relationships: Correlation. Correlation. Correlation. Bivariate linear correlation. Correlation 9/8/2018

Correlation. Tests of Relationships: Correlation. Correlation. Correlation. Bivariate linear correlation. Correlation 9/8/2018 Tests of Relationships: Parametric and non parametric approaches Whether samples from two different variables vary together in a linear fashion Parametric: Pearson product moment correlation Non parametric:

More information

Box-Cox Transformations

Box-Cox Transformations Box-Cox Transformations Revised: 10/10/2017 Summary... 1 Data Input... 3 Analysis Summary... 3 Analysis Options... 5 Plot of Fitted Model... 6 MSE Comparison Plot... 8 MSE Comparison Table... 9 Skewness

More information

ANOVA: Analysis of Variation

ANOVA: Analysis of Variation ANOVA: Analysis of Variation The basic ANOVA situation Two variables: 1 Categorical, 1 Quantitative Main Question: Do the (means of) the quantitative variables depend on which group (given by categorical

More information