Fitting Large-Scale Spatial Models with Applications to Microarray Data Analysis

Size: px

Start display at page:

Download "Fitting Large-Scale Spatial Models with Applications to Microarray Data Analysis"

Rodney Bridges
6 years ago
Views:

1 Fitting Large-Scale Spatial Models with Applications to Microarray Data Analysis Stephan R Sain Department of Mathematics University of Colorado at Denver Denver, Colorado ssain@mathcudenveredu Reinhard Furrer Geophysical Statistics Project National Center for Atmospheric Research Boulder, Colorado furrer@ucaredu Many problems in the environmental and biological sciences involve the analysis of large quantities of data Further, the data in these problems are often subject to various types of structure and, in particular, spatial dependence Traditional model fitting often fails due to the size of the datasets since it is difficult to not only specify but also to compute with the full covariance matrix For example, a single microarray can include over 400,000 individual observations We propose using a very general type of mixed model that has a random spatial component Recognizing that spatial covariance matrices often exhibit a large number of zero or near-zero entries, covariance tapering is used to force near-zero entries to zero Then, taking advantage of the sparse nature of such tapered covariance matrices, backfitting is used to estimate the fixed and random model parameters Results will be demonstrated on a experiment using microarrays to build a profile of differentially expressed genes relating to cerebral vascular malformations, an important cause of hemorrhagic stroke and seizures Keywords: Mixed effects; Backfitting; Covariance Tapering; Sparse matrices 1 Introduction Many spatial problems are inherently multivariate with more than one measurement or observation at each spatial location Moreover, many spatial problems involve a large number of spatial locations This leads to serious computational difficulties in constructing, storing, and manipulating very large regression and covariance matrices Such problems arise in a number of areas from traditional environmental statistics and epidemiology to new approaches to biological problems For example, the authors present research in this area include combining observed climate data and climate models to examine climate model behavior as well as predictions of climate change In this setting, there are two variables, precipitation and temperature, for sixteen different models on a 5 grid resulting in observations per climate model Of particular interest in this paper is a problem of considerable current interest, namely analyzing microarray data for biological experiments In this case, we are attempting to build a profile of differentially expressed genes relating to cerebral vascular malformation (Shenkar et al, 2003) In this study, there are roughly twenty gene chips with three disease groups (control and two disease states) and with each chip basically a array of approximately 400,000 observations We propose a simple, multivariate, additive (mixed-effects) spatial model and discuss some strategies for fitting such models and estimating model parameters 1

2 2 SAIN AND FURRER when the size of the data structures are large There are two key aspects First, recognizing that many if not most of the elements of the spatial covariance matrices are zero or near-zero, covariance tapering is used to force near-zero entries to zero which introduces a great deal of sparseness in the covariance matrices This sparseness allows such matrices to be stored and manipulated more efficiently Second, the additive structure in the model is exploited using a backfitting algorithm for parameter estimation The next section develops the model in detail while Section 3 discusses several computational issues when using huge datasets with backfitting algorithms Section 4 shows qualitative and quantitative results of a small example using microarray data analysis Finally, we discuss in Section 5 current and future research of this longterm project 2 A Multivariate, Additive Spatial Model A simple, multivariate, additive (mixed-effects) spatial model for an observation vector Y can be written as Y = Xβ + h + ɛ, (1) where Xβ represent fixed effects; h represents a random, zero-mean spatial process with Var(h) = Σ h ; ɛ represents a random, zero-mean error process with Var(ɛ) = Σ ɛ, orthonormal to h Model (1) is generic; in the case of gene expression on a chip, the fixed effects can be expanded to Y = β mean + Rβ row + Cβ col + Gβ Gene + h + ɛ (2) From a biological point of view, β Gene is the quantity of interest whereas the chip specific effects β Chip = [ β mean, β T row, β T col ]T are ancillary and are included to account for any chip specific, large-scale trends observed in the data It is convenient to parameterize the spatial covariance matrix by θ (Σ h = K(θ)) and to assume a white noise measurement error (Σ ɛ = σ 2 I) Suppose we have k different chips having an identical gene layout Then, model (1) is expanded to F 0 0 G Y 1 Y k = 0 F F G β Chip 1 β Chip k β Gene + h Chip 1 h Chip k + ɛ Chip 1 ɛ Chip k (3) with F = [ 1, R, C ] and where the spatial processes h Chip i are mutually independent The covariance matrices of the spatial and the random process are assumed to take the forms K σ 2 1I 0 0 Σ h = 0 K 2 0 and Σ ɛ = 0 σ 2 2I 0, (4) 0 0 K k 0 0 σk 2I

3 FITTING LARGE-SCALE SPATIAL MODELS 3 where K i = K(θ i ) represents a chip specific spatial covariance matrix parameterized by θ i ; are chip specific variances, called nugget effect in geostatistical literature σ 2 i The independence assumption across chips is justified by the fact that, typically, the chips are based on unique tissue samples, often from different individuals The chip specific fixed (β Chip i ) and random, spatial (h Chip i ) effects are included to account for chip specific large-scale and small-scale spatial trends This type of structure is able to model non-linear relationships in much the same fashion as smoothing splines (Nychka, 2000) Note that this model can also be written in an additive fashion, separating chip specific and gene specific effects For small samples, one could use ML or REML (Kitanidis, 1997; Stein, 1999) to fit covariance parameters, estimates of β and predictions of the random effects follow directly: β = (X T V 1 X) 1 X T V 1 Y (generalized least-squares) (5) ĥ = Σ h V 1 (Y X β) where V = Σ h + Σ ɛ These estimates are equivalent to the universal kriging solutions, ie the best linear unbiased predictor (eg Cressie, 1993) In our setting, direct computation with the design and covariance matrices is impossible as the observation vector Y, even with only one or two chips is too big We solve this problem with backfitting algorithms outlined below 21 Backfitting with One Chip Backfitting procedures are widely used in additive or generalized linear/additive models, eg Breiman and Friedman (1985); Buja et al (1989) Applied to equation (1), the backfitting algorithm consists of estimating iteratively the fixed effects β (regression step) and the spatial process h (kriging step), as schematized below [1] Let ĥ(0) be an initial guess and put j = 1 [2] β(j) = ( X T X ) 1 X T ( Y ĥ(j 1)) [3] Estimate covariance parameters to get θ (j) and σ 2(j), then put ĥ (j) = Σ ( V 1 h Y X β (j)) [4] Put j = j + 1 and repeat [2] and [3] until convergence To prove equivalence after convergence, plug [3] into [2] and a few straightforward manipulations lead to the generalized least-squares estimator (5) The convergence criterion in step [4] should be based on the estimates β (j) (j), θ and σ 2(j), for example, absolute or relative mean squared differences The algorithm usually converges in a few steps We will come back to this issue in Section 4 In the setting of a single microarray chip, the design matrix X is too big to compute with for available computing resources We use therefore equation (2) to separate the different fixed effects and perform the regression step [2] iteratively on the chip specific effects β Chip and the gene effects β Gene Thus, step [2] becomes:

4 4 SAIN AND FURRER [2a] (0) Let β Gene be an initial guess and put l = 1 [2b] β(l) Chip = ( F T F ) 1 ( F T Y ĥ(j 1) G β (l 1) ) Gene [2c] β(l) Gene = ( G T G ) 1 ( G T Y ĥ(j 1) F β (l) ) Chip [2d] Put, l = l + 1 and repeat [2b] and [2c] until convergence, then β (j) = ( β(l 1) Chip T (l 1), β T) T Gene 22 Backfitting with Several Chips Suppose we have k different chips According to equation (3) we extend the backfitting algorithm presented in the last section As there is no spatial structure between different chips (cf equation (4)), we estimate and fit the spatial structure on each chip separately In a similar way, the chip specific effects depend only on the observations of the corresponding chip Only the gene effects have to be considered across all observations It can be shown that they can be fitted by taking the mean of the centered observations Z i = Y Chip i h (j 1) Chip i F β(l) Chip i Therefore, the design matrices are identical to the case of a single chip This yields the modified backfitting algorithm below [1 ] Let ĥ(0) Chip i, i = 1,, k, be an initial guess and put j = 1 [2a ] Let β (0) Gene be an initial guess and put l = 1 [2b ] For i = 1,, k, β (l) Chip i = ( F T F ) 1 ( F T Y Chip i ĥ(j 1) Chip i [2c ] Let Z = 1 k k i=1 Y Chip i ĥ(j 1) Chip i β (l) Gene = ( G T G ) 1 G T Z (l 1)) G β Gene (l) F β Chip i, then put [2d ] Put, l = l + 1 and repeat [2b ] and [2c ] until convergence, then β (j) = ( β(l 1) Chip [3 ] For i = 1,, k, T (l 1), β T) T Gene estimate covariance parameters to get ĥ (j) θ (j) i and σ 2 i (j), then put Chip i = K i( θ (j) i ) ( K i ( θ (j) i ) + σ i 2(j) I ) 1( (j) (j) YChip i F β Chip i G β Gene [4 ] Put j = j + 1 and repeat [2a ] to [3 ] until convergence ) This backfitting algorithm uses essentially the same amount of storage as the algorithm for a single chip only Note that the computing time of step [2c ] is comparable with step [2c] Whereas steps [2b ] and [3 ] are k times as expensive as the respective steps in the single chip case

5 FITTING LARGE-SCALE SPATIAL MODELS 5 3 Computational Issues One of the aims of this study was to see whether this kind of analysis could be done with existing software on a reasonable sized desktop computer We decided to use the freely available computer software R (Ihaka and Gentleman, 1996; R Development Core Team, 2004) with a RedHat Linux system and 2 Gbytes of RAM 31 Sparse Matrices The design matrices F and G contain as entires ±1 and a vast amount of zeros If such huge matrices contain only a small percentage of nonzero elements, it is advantageous to use more complex storing methods than a simple double indexed array One commonly used structure consists of using three vectors, where the first contains the nonzero elements, the second the column indexes of the elements stored in the first, and the last pointers to the beginning of each matrix row in the first two vectors For a matrix with z nonzero elements we thus need z reals and z + n + 1 integers compared to n n reals (eg George and Liu, 1981, see also Table 1 as explained in Section 33) The R package SparseM (Koenker and Ng, 2003) contains a few rudimentary functions for handling sparse matrices We used their concept of representing sparse matrices and wrote the backfitting procedure in a linear, sequential way, calling as few functions as possible in order to save memory Computationally expensive blocks, such as the construction of the design matrices are coded in Fortran 77 The coding is similar to the functions given in Furrer (2004) 32 Covariance Tapering In the backfitting algorithm, steps [3] or [3 ] are best unbiased linear predictions (BLUP) of a spatial field, also called simple kriging in geostatistical literature The BLUP essentially requires solving the huge linear system Vx = Z, where Z contains centered observations Computationally, we first perform a Cholesky factorization L T L = V and then successively solve the triangular systems Lw = Z and L T x = w, giving x = V 1 Z Typical covariance structures imply full matrices V Tapering the covariance function with some positive definite, compactly supported function induces a sparseness structure in V and preserves asymptotic optimality (Furrer et al, 2004) The taper range determines the degree of approximation but also the sparseness of V As a rule of thumb, Furrer et al (2004) recommend to use points within the taper range In our setting, we cannot meet this proposal because of memory limitations With taper distance 2 < η 2 (ie 8 points) the Cholesky factor of K i contains a number of nonzero elements of the order 10 6 However, with 2 < η 3 (ie points) the Cholesky factor of K i contains a number of nonzero elements of the order 10 8 We will therefore use a taper length of 2, leading to 3,614,762 (0002%) and 67,070,820 (0040%) nonzero elements in the covariance matrix and its Cholesky factor, respectively, for the single chip case We suppose that the spatial process is stationary and isotropic such that the ijth element of the covariance matrices is given by positive definite function k(h; θ 1, θ 2 ), where h is the distance between observation i and j The parameter θ 1 = k(0) is called the sill and θ 2 is the range parameter, responsible for the rate at which the covariance decays

6 6 SAIN AND FURRER 33 Choice of Contrasts The design matrices are sparse, but the choice of the contrasts determines to what extent Therefore, this choice is crucial to our objective As an illustration, Table 1 gives the percentages of nonzero elements for sum and treatment contrasts for one chip Table 1: Sparseness for different contrasts Percentages of nonzero elements compared to a full matrix Note that X has more than elements For two cases only lower bounds can be given due to limited RAM F F T F G G T G X X T X Treatment Sum > > 6626 As we decouple the chip and gene effects in the regression step of the backfitting algorithm (steps [2b ] and [2c ]), we can switch between different contrasts for each of those effects For interpretability reasons, we choose for F sum and for G treatment contrasts Covariance tapering and the additional iteration of the gene and chip effects (steps [2a,,d] and [2,,d ]), cuts the computational and storage cost considerably However, 2 Gbytes of RAM are not sufficient for our application to keep all matrices permanently in memory Hence, for each regression and kriging step we construct the individual design and covariance matrices and eliminate them afterwards 34 Standard Errors To calculate standard errors of the parameter β, we could simply use equation (5) to deduct Var( β) = (X T V 1 X) 1 Var(Y) However, this variance cannot be calculated directly, since the matrices are too big In the case of one chip, one could simplify the expression to Var( β Gene ) = ( (G T G) 1 G T F(F T F) 1 F T G ) Var(Y) With our existing computing resources, it is still not possible to evaluate this quantity We therefore use the simplistic approximation Var( β Gene ) ( θ2 1 + σ 2) (G T G) 1 (6) 4 Application The raw data from the microarray chips was rounded to the nearest 1/4 We therefore blurred the data prior to our analysis with a white noise according a uniform ( 0125, 0125) variable Then we took the logarithm and subtracted the mean Those transformed observations were plugged into the backfitting algorithms previously outlined The next two sections present the results for a single and a double chip model

FITTING LARGE-SCALE SPATIAL MODELS 7 Figure 1: Two-dimensional empirical covariance function for single chip example 41 Single Chip Results Figure 1 shows the two-dimensional empirical covariance

7 FITTING LARGE-SCALE SPATIAL MODELS 7 Figure 1: Two-dimensional empirical covariance function for single chip example 41 Single Chip Results Figure 1 shows the two-dimensional empirical covariance function after fitting, confirming strongly isotropy of the spatial process To our knowledge there is no a priori reason that the spatial covariance has a particular structure Given the gridded data, empirical covariance estimates are not able to refute a linear behavior of the covariance function at the origin (cf Figure 2) We suppose an exponential covariance structure and use a spherical taper with a fixed taper range of 2 (being the maximum possible value computationally feasible) The resulting covariance function for K 1 can be written as k(h; θ 1, θ 2 ) = θ 1 exp ( h )(1 3h ) θ 2 4 h3 16 Figure 2 shows the ordinary least squares fits with θ 1 = 0487, θ 2 = 1528 and nugget effect σ 2 = 0061 As presumed, the backfitting algorithm converges quickly, MSE ( β(j) β (j 1)) < 10 4 for j 4 Figure 3 shows how quickly a few randomly selected coefficients converge Table 2 gives the required computing time on a Linux powered 26 GHz Xeon processor with 2 Gbytes of RAM Figure 6 displays the row and column effects (the color bar was taken over the range of the displayed data) The top panel reflects the fact that perfect match and miss match are on alternating rows The column effects might indicate a small trend with higher values at the right But the row and column effects are small compared x + x o + x + x o empirical horizontal empirical vertical empirical off axis fitted exponential tapered covariance: exp*spher taper covariance o o o + x o o oo o o o o + x o o o o o o + x o o o o + x o + xo lag Figure 2: Empirical and fitted covariances for single chip example

8 8 SAIN AND FURRER Iteration Iteration Figure 3: Convergence of randomly selected fixed effects parameters (row, column effects left panel, gene effects right panel) for single chip example to the spatial process (Figure 7, top) The observations are the sum of the chip specific effects (Figure 7, bottom), the gene effects and the residuals (Figure 8, top and bottom) The residuals indicate, that all the structure could be explained with the fixed effects and a spatial process as there is no spatial structure or pattern left Figure 4 shows the effects between perfect match and miss match, which are slightly positively skewed We tried to normalize the effects using the approximation (6) However, little difference is observed and there are a larger number of normalized effects beyond the typical threshold values of ±2 42 Double Chip Results The algorithm was applied to two chips based on different samples from a single individual Figure 9 shows the differences between the chip effects and spatial processes in the case of two chips Although the differences between the chip effects are small compared to the fitted effects, they exhibit an interesting pattern Chip Table 2: Computing time for different steps in the backfitting algorithm in R The values represent the mean of the iterations (Linux, 26 GHz Xeon processor with 2 Gbytes of RAM) Action Time (sec) Read data, variable setup 559 Create the matrix F T F 1444 Solve F T Fx = b 161 Create the matrix G T G 1776 Solve G T Gx = b 308 Total typical regression step (4 iterations) 4528 Estimate covariance parameters 1219 Create the matrix Σ h 1048 Solve Σ h x = b 4567 Total typical kriging step 7551 Total backfitting (4 iterations) 50623

9 FITTING LARGE-SCALE SPATIAL MODELS 9 Miss match Miss match Perfect match Perfect match Figure 4: Gene effects between perfect match and miss match for single chip example (the right panel gives the normalized effects) The horizontal and vertical lines are the means The red and blue curves are smoothed histograms The dotted curves are superimposed normal densities 2 has almost exclusively bigger effects The spatial difference shows some rather large blotches suggesting substantial differences in chip specific large-scale trends This emphasizes that large and small-scale chip specific effects and trends can be modeled and extracted Figure 5 compares the fitted gene effects obtained from the analysis with a single chip only and from taking account of both Reassuringly, there do not seem to be substantial differences in the gene effects across the two chips from the same individual Effects with two chips Effects with one chip Figure 5: Comparison of the fitted gene effects with the single chip and the double chip model

10 10 SAIN AND FURRER 5 Discussion and Outlook Our goal in this project was essentially a proof-of-concept to establish that traditional additive, mixed-effects models for multivariate spatial data could be used to analyze large-scale data problems such as those posed in the environmental and biological sciences We now begin serious application of these methods, in particular to the microarray data from experiments associated with cerebral vascular malformations (among others) More specifically, we seek a more detailed analysis, including the examination of genes labeled as differentially expressed and the comparison with the results from more established methods Moreover, we seek to examine the differences in differentially expressed genes for the different disease groups in our study This will involve additional modifications to the design matrices However, we do not perceive this to be a serious complication and the backfitting algorithms can easily be modified to account for these changes Our models currently assume constant means across the probe-level data for each specific gene on the gene chip There seems to be evidence, both from our own empirical studies and in the the biological literature, that this is not the case We are exploring improved models to account for this additional structure in the data Finally, there are additional computational improvements currently being examined We are exploring computing environments that do not have the memory limitations of 2 Gbytes of RAM We are also exploring ways of imposing less-severe tapering of the spatial covariances in order to approach the more optimal conditions discussed in Furrer et al (2004) In addition, the fairly regular lattices observed in microarray data lead to a particular sparse structure in the Cholesky factor Our preliminary experiments suggest could be exploited to dramatically improve computational performance Acknowledgments The authors would like to thank Professor Isam Awad and Robert Shenkar (Department of Neurological Surgery, Feinberg School of Medecine, Northwestern University) as well as Edith Creek (Department of Mathematics, University of Colorado at Denver) for providing the data and answering our numerous questions The research of the first author was supported in part by a grant from the University of Colorado Genome-Biotechnology Initiative The research of both the first and second authors was supported in part by the Geophyical Statistics Project at the National Center for Atmospheric Research under the National Science Foundation grant DMS References Breiman, L and Friedman, J H (1985) Estimating optimal transformations for multiple regression and correlations (with discussion) Journal of the American Statistical Association, 80, Buja, A, Hastie, T J, and Tibshirani, R J (1989) Linear smoothers and additive models (with discussion) Annals of Statistics, 17, Cressie, N A C (1993) Statistics for Spatial Data John Wiley & Sons Inc, New York, revised reprint 3

11 FITTING LARGE-SCALE SPATIAL MODELS 11 Furrer, R (2004) KriSp: An R package for Covariance Tapered Kriging of Large Datasets Using Sparse Matrix Techniques Software/KriSp/ 5 Furrer, R, Genton, M G, and Nychka, D (2004) Covariance Tapering for Interpolation of Large Spatial Datasets Submitted to Journal of Computational and Graphical Statistics 5, 10 George, A and Liu, J W H (1981) Computer solution of large sparse positive definite systems Prentice-Hall Inc, Englewood Cliffs, N J 5 Ihaka, R and Gentleman, R (1996) R: A language for data analysis and graphics Journal of Computational and Graphical Statistics, 5, Kitanidis, P K (1997) Introduction to Geostatistics: Applications in Hydrogeology Cambridge University Press 3 Koenker, R and Ng, P (2003) SparseM: Sparse Matrix Package for R 5 Nychka, D W (2000) Spatial-process estimates as smoothers In Schimek, M G, editor, Smoothing and Regression: Approaches, Computation, and Application, chapter 13, John Wiley & Sons Inc, New York 3 R Development Core Team (2004) R: A language and environment for statistical computing R Foundation for Statistical Computing, Vienna, Austria 5 Shenkar, R, Elliott, J P, Diener, K, Gault, J, Hu, L, Cohrs, R J, Phang, T, Hunter, L, Breeze, R E, and Awad, I A (2003) Differential gene expression in human cerebrovascular malformations (with discussion) Neurosurgery, 52, Stein, M L (1999) Interpolation of Spatial Data Springer-Verlag, New York 3

12 12 SAIN AND FURRER Figure 6: Row (top) and column (bottom) effects for single chip example (back to text)

13 FITTING LARGE-SCALE SPATIAL MODELS 13 Figure 7: Spatial process (top) and chip specific effects (bottom) for single chip example (back to text)

14 14 SAIN AND FURRER Figure 8: Gene effects (top) and residuals (bottom) for single chip example (back to text)

15 FITTING LARGE-SCALE SPATIAL MODELS 15 Figure 9: Differences for the chip effects (top) and spatial processes (bottom) in the case of two chips (back to text)

Spatial Backfitting of Roller Measurement Values from a Florida Test Bed

Spatial Backfitting of Roller Measurement Values from a Florida Test Bed Daniel K. Heersink 1, Reinhard Furrer 1, and Mike A. Mooney 2 1 Institute of Mathematics, University of Zurich, CH-8057 Zurich 2