Inference with Transposable Data: Modeling the Effects of Row and Column Correlations

Size: px

Start display at page:

Download "Inference with Transposable Data: Modeling the Effects of Row and Column Correlations"

Kenneth Patterson
5 years ago
Views:

1 Inference with Transposable Data: Modeling the Effects of Row and Column Correlations Genevera I. Allen Department of Pediatrics-Neurology, Baylor College of Medicine, Jan and Dan Duncan Neurological Research Institute, Texas Children s Hospital,& Department of Statistics, Rice University, Houston, TX, Robert Tibshirani Departments of Health Research & Policy and Statistics, Stanford University, Stanford, CA, Summary. We consider the problem of large-scale inference on the row or column variables of data in the form of a matrix. Many of these data matrices are transposable meaning that neither the row variables nor the column variables can be considered independent instances. An example of this scenario is detecting significant genes in microarrays when the samples may be dependent due to latent variables or unknown batch effects. By modeling this matrix data using the matrix-variate normal distribution, we study and quantify the effects of row and column correlations on procedures for large-scale inference. We then propose a simple solution to the myriad of problems presented by unanticipated correlations: We simultaneously estimate row and column covariances and use these to sphere or de-correlate the noise in the underlying data before conducting inference. This procedure yields data with approximately independent rows and columns so that test statistics more closely follow null distributions and multiple testing procedures correctly control the desired error rates. Results on simulated models and real microarray data demonstrate major advantages of this approach: (1) increased statistical power, (2) less bias in estimating the false discovery rate, and (3) reduced variance of the false discovery rate estimators. Keywords: multiple testing, false discovery rate, transposable regularized covariance models, large-scale inference, covariance estimation, matrix-variate normal 1. Introduction As statisticians, we often make assumptions when constructing a model to ease computations or employ existing methodologies. When conducting inference on matrix data, we often assume that the variables along one dimension (say the columns) are independent, allowing us to pool these observations to make inferences on the variables along the other dimension (rows). In microarrays, for example, it is common to assume that the arrays are independent observations when computing test statistics allowing us to assess differential expression in genes. Since we are testing many row variables (genes, for example) simultaneously, we correct for multiple testing using procedures that are known only to control error measures when the row variables are independent or follow limited dependence structures. Thus, when conducting inference along the row variables of matrix data, we make the following assumptions to employ existing methodologies: (i) independent column variables and (ii) independent or limited dependencies among row variables. What if these assumptions are incorrect? What if this matrix data is in fact transposable, meaning that potentially both the rows and the columns are correlated? To whom correspondence should be addressed.

2 2 G. I. Allen & R. Tibshirani In this paper, we consider the problem of testing the significance of row variables in a data matrix where there are correlations among the rows, or among the columns, or among both. We study the behavior of standard statistical methodology, such as two-sample t- statistics and controlling for multiplicity via the false discovery rate (FDR), on transposable data. Then, we propose a method to directly account for these two-way dependencies by de-correlating the data matrix before conducting inference. We motivate the presence of these two-way dependencies through an example we will refer to often: testing for differential expression in the two-class microarray. Consider a microarray data set investigating differential gene expression between subjects of Asian descent or European descent in which there are 4,167 genes and 142 subjects with arrays processed between 2003 and 2007 (Spielman et al., 2007). In Figure 1, we display the histogram of two-sample t-statistics for this data set superimposed with the density of the theoretical null distribution, the t (140) distribution. The test statistics are strongly overdispersed compared to the theoretical null distribution. Many have proposed that this effect could be due to correlations among the genes (Dudoit et al., 2003; Efron, 2004; Qiu et al., 2005; Efron, 2010). Others have noted that correlations among the arrays, perhaps induced by their differing processing dates in this example, can produce the same effect (Qiu et al., 2005; Owen, 2005; Efron, 2009). What if this apparent over-dispersion is caused by both gene and array correlations? And, what effect do these correlations have on standard statistical methods used for large-scale inference? Answers to these questions have been well studied for correlations among the rows or test statistics (Dudoit et al., 2003; Efron, 2004; Owen, 2005; Qiu and Yakovlev, 2006; Efron, 2010; Schwartzman and Lin, 2011). Methods to control the false discovery rate such as the step-up method of Benjamini and Hochberg (1995) or the permutation based method of Storey (2002) were originally shown to have theoretical control only when test statistics are independent. Later, however, these conditions have been relaxed to show that the FDR is controlled under types of weak dependencies (Yekutieli and Benjamini, 1999; Storey et al., 2004; Sarkar, 2008). Benjamini and Yekutieli (2001) developed a step-up procedure that permits FDR control under arbitrary dependencies among tests, but at a great cost in terms of statistical power (Farcomeni, 2008). Thus, this method is not preferred in the literature. As the conditions of weak dependence are not easily checked with real data, it is unknown whether false discovery rates are controlled in data with strong correlations among tests such as with microarrays. In addition, recently many have noted when tests are correlated, one not only needs to worry about average false discovery proportion, but also about the variance of the number of false discoveries and the FDR (Owen, 2005; Qiu and Yakovlev, 2006; Desai et al., 2009; Efron, 2010; Schwartzman and Lin, 2011). While correlations among test statistics have been studied especially in the context of microarray data, correlations among the columns and its effect on large-scale inference has received much less scrutiny. First, one might ask whence correlations among columns arise. For microarrays, these can occur because of batch effects, latent variables or instrument drift, for example (Yang et al., 2002; Fare et al., 2003; Li and Rabinovic, 2007; Leek and Storey, 2008; Efron, 2009; Leek et al., 2010a). If these are known to the statistician in advance, these can be modeled directly (Li and Rabinovic, 2007; Leek et al., 2010a). Many of these causes of potential array correlations are unknown to the statistician or unavailable for data in many public gene expression repositories (Edgar et al., 2002). In addition, simply assessing the presence of column correlations in the presence of row correlations is significant challenge (Efron, 2009; Muralidharan, 2010). Not surprisingly then, relatively little work has been done on modeling and correcting for the effects of both row and column correlations in the context of large-scale inference. In this paper, we propose to study and develop methodology to correct for row and column correlations when conducting inference on matrix data. As the effect of the later

3 Inference and Dependence 3 Density T Statistics Fig. 1. Histogram of two-sample T -statistics for the Spielman et al. (2007) microarray data. The theoretical null, t (140), is superimposed, and the T -statistics are over-dispersed compared to the null distribution. is less developed in the literature, we focus more on the behavior of common test statistics and null distributions when there are unanticipated column correlations. We show that even though many have noted that over-dispersion of test statistics can result from correlated tests (Efron, 2004, 2010), unanticipated correlations among the columns can also lead to this result. The main contribution of this paper is a novel procedure to de-correlate both the rows and columns of the data matrix prior to conducting inference. Several have proposed such methods in the context of only row correlations (Tibshirani and Wasserman, 2006; Lai, 2008; Zuber and Strimmer, 2009) or for latent variable models (Leek and Storey, 2008), but none for scenarios in which both the rows and columns are correlated with arbitrary structure. This may be surprising as the idea seems like a simple and logical first step to tackling the problems arising from strongly correlated matrix data. It turns out, however, that estimating both row and column covariance matrices from a single matrix of data is a major challenge (Efron, 2009; Muralidharan, 2010). We model separable row and column covariances via the matrix-variate normal which assumes that the covariance between elements of the data matrix is given by the Kronecker product of its column and row covariances (Gupta and Nagar, 1999). This matrix-variate or Kronecker product model has been used by others in the context of microarray data (Efron, 2009; Teng and Huang, 2009), but not for the challenging problem of directly estimating of row and column covariances. In prior work, however, we have developed a method of simultaneously estimating row and column covariances via a penalized likelihood approach (Allen and Tibshirani, 20). In this paper, we introduce a novel procedure that uses these covariances estimates to de-correlate or sphere the noise in the data matrix without changing the underlying signal. This sphered data, which has approximately independent rows and columns, can then be used to conduct large-scale inference. Our approach has several advantages: (i) Tests are re-ordered leading to a better ranking, a lower true false discovery proportion, and greater statistical power. (ii) Estimates of the false discovery proportion are more consistent. (iii) The variance of the estimated false discovery rate is reduced. The paper is organized as follows. In Section 2 we introduce our matrix model based on the mean-restricted matrix-variate normal distribution. We then study the behavior of test statistics when the columns are correlated, Section 2.2, and illustrate how common problems associated with microarray data can lead to unanticipated array (column) correlations, Section 2.3. In Section 3, we develop our main sphering algorithm. Results on both simulated models and real microarray data are given in Section 4 and we conclude with a discussion of our work in Section 5.

4 4 G. I. Allen & R. Tibshirani 2. Framework: A Matrix-variate Model We present a matrix decomposition model based on the matrix-variate normal distribution. Using this model, we study the effects of unanticipated column correlations on common two-sample test statistics, showing that non-zero column correlations lead to an over or under dispersion of the null distribution Matrix-variate Model We propose to study row and column correlations through a simple matrix decomposition model based on the matrix-variate normal. We motivate the use of this distribution through the example of microarray data. With this data, the genes are often assumed to follow a multivariate normal distribution with the arrays independent and identically distributed. Since we aim to study the effects of array correlations, we need a parametric model that has the flexibility to model either array independence or various array correlation structures. To this end, we turn to the mean-restricted matrix-variate normal introduced in Allen and Tibshirani (20), a variation of the familiar matrix-variate normal (Gupta and Nagar, 1999). For data, X R m n, this distribution is denoted as X N m,n (ν, µ, Σ, ) and has separate mean and covariance parameters for the rows, ν R m and Σ R m m, and columns, µ R n and R n n. Thus, we can model array correlations directly though the covariance matrix. If the data matrix is transformed into a vector of length mn, we have that vec(x) N(vec(M), Ω), where M = ν1 T (n) + 1 (m)µ T and Ω = Σ. Also, the commonly used multivariate normal is a special case of the distribution. If = I and µ = 0, then X N(ν, Σ). In fact, all marginal models of the matrix-variate normal are multivariate normal, meaning that both the genes and arrays separately are multivariate normal. Further properties of this distribution are given in Allen and Tibshirani (20). In our matrix decomposition model, we assume that there is an additional signal beyond the row and column means. We then decompose the data into a mean, signal, and correlated noise matrix as follows: X m n = M m n + S m n + N m n. (1) Here, M = ν1 T (n) + 1 (m)µ T is the mean matrix, S is the problem specific signal matrix, and N N m,n (0, 0, Σ, ) is the noise matrix. Thus, X S N m,n (ν, µ, Σ, ), meaning that after removing the signal, the data follows a mean-restricted matrix-variate normal distribution. With two-class microarray data, for example, the signal matrix captures the class means. Let there be n 1 arrays in class one, indices denoted by C 1, and n 2 in class two, denoted by C 2. (For simplicity of notation, we assume that the first n 1 arrays are in class one and the last n 2 arrays are in class two.) Let the class [ signals be ψ 1 ] R m and ψ 2 R m. Then, the signal matrix, S, can be written as S = ψ 1 1 T (n ψ 1) 21 T (n 2). Notice that our matrix decomposition model is similar in spirit to the latent variable model of Leek and Storey (2008) also proposed in the context of large-scale inference. There are several further remarks to make regarding this model. Prior to analyzing data, it is common to standardize the rows. Some have proposed to doubly-standardize this two-way data by iteratively scaling both the rows and columns (Efron, 2009; Olshen and Rajaratnam, 2010). With our model, we center both the rows and columns through the mean matrix M, but do not directly scale them. Instead, we allow the diagonals of the covariance matrices of the rows, Σ, and columns, to capture the differences in variablities. Thus, our model keeps the mean and variances separate in the estimation process.

5 Inference and Dependence Null Distributions: The Two-Class Problem We study the effect of column correlations on the theoretical null distribution of two-sample test statistics computed for a single row of the data matrix. More specifically, we calculate the distributions of test statistics under our matrix decomposition model instead of the typical two-sample framework where samples are drawn independently from two populations. In the familiar two-class inference problem, we have a vector x = [x T 1 x T 2 ] T with x 1 of length n 1 and x 2 of length n 2 where the elements of each vector are x 1,i N(ψ 1, σ 2 ) and x 2,i N(ψ 2, σ 2 ). We wish to test whether there is a shift in means between the two classes, namely H 0 : ψ 1 = ψ 2 vs. H 1 : ψ 1 ψ 2. Here, we assume that the variances, σ 2, are equal between the two classes, but note that analogous results are obtainable if the variance is assumed to be unequal, as is often the case with microarray data. If the variance, σ 2 is known, we have the familiar two-sample Z-statistic, Z = ( x 1 x 2 )/σ c n that follows the distribution Z N((ψ 1 ψ 2 )/σ c n, 1), where x k = 1 nk n k i=1 x i and c n = 1 n n 2 (Lehmann and Romano, 2005). Going back to our matrix decomposition model, we wish to know the distribution of this Z-statistic for each row when there are column correlations: ( Claim 1. Let x T = [x T 1 x T 2 ] N 1,n 0, [ψ1 1 (n1) ψ 2 1 (n2)], σ 2, ) ( ψ1 ψ 2. Then, Z N σ { c n 1 i C n n 2 i C 2 where η = n i=1 n j=1 ij W i W j = W T W, with W i Thus, when the columns are correlated, the variance of the two-sample Z-statistic is inflated or deflated by η. In terms of the matrix decomposition model, the assumptions of Claim 1 correspond to a row vector that has previously been centered by ν and µ, has signal [ψ 1 1 (n1) ψ 2 1 (n2)], column covariance, and row variance σ 2, the diagonal element of Σ. Note that separability of the covariance matrices of our model allows us to dissect the effects of column correlations in this manner. Notice that if = I, η = c n and the variance of Z is one, as desired. If there is only column correlation within the two classes, then the effects of these correlations can be parsed as follows:, η ), c n Corollary ( 1. Let x T 1 N 1,n1 (0, ψ 1 1 (n1), σ 2, 1 ) independent of x T 2 N 1,n2 (0, ψ 2 1 (n2), σ 2, 2 ). ψ1 ψ 2 Then, Z N σ, η ) 1 + η 2, where η k = 1 nk nk c n c n 2 i=1 j=1 k,ij for k = 1, 2. n k These effects are explored numerically in a small study described below. We have assumed that the row variance, σ 2, was known, however, in most microarray experiments this is not known and must be estimated. With σ 2 unknown, the two-sample t- statistic is used: T = ( x 1 x 2 )/s x1,x 2 cn, where s 2 x 1,x 2 is the pooled estimate of the sample variance. Under the null hypothesis, T t (n 2), while under the alternative, T t(δ) (n 2), a non-central t distribution with non-centrality parameter δ = (ψ 1 ψ 2 )/(σ c n ) (Lehmann and Romano, 2005). When there are column correlations as in the assumptions of Claim 1, however, the distribution of T does not have a closed form. (The square of the pooled sample standard deviation is no longer distributed as a Chi-squared random variable and the numerator and denominator of T are not independent.) Hence, we explore the effects of column correlations on the T -statistic through a small simulation study. Data is simulated according to the assumptions of Claim 1 with n = 50 columns with n 1 = n 2 = 25 in each class. Four structured covariance matrices were used to assess the variances of Z and T -statistics: 1 : 1,ij = 0.9 i j, 2 is block diagonal with blocks of size 10, and within each block 2,ij = 0.9 i j, 3 : 3,ij = 0.5 i j, 4 is block diagonal with blocks of size 10 and within

6 6 G. I. Allen & R. Tibshirani Density N(0,1) Density T, df = Z Statistic T Statistic Fig. 2. Comparison of theoretical null distributions for the two-sample Z-statistic (left) and T -statistic (right) under four column correlation scenarios given in Section 2.2. Variances of the Z-statistics were calculated by the result in Claim 1, while the densities of the T -statistics were estimated via Monte Carlo simulation. Var(Z-statistic) Var(T -statistic) (0.029) (0.0144) ( ) ( ) Fig. 3. Variances of the two-sample Z and T -statistics under four column correlation scenarios. The theoretical variance of the Z-statistic should be 1, and for the T -statistic. each block, 4,ij = 0.5 i j. Positive correlation structures are used as most observed array covariances in microarray studies are positive. We note that with negative correlations among columns, η < c n resulting in under-dispersed null distributions. Figure 2 demonstrates the effect of column correlations on the distributions of Z and T. We see that positive column correlations can cause dramatic over-dispersion of the test statistics compared to their theoretical null distribution. This is a possible explanation to the over-dispersion seen in the real microarray examples displayed in Figure 1. Compared to the variance of the Z-statistic, the T -statistic is even more affected by column correlations. This is confirmed in Table 3 where we present the variances of the Z-statistic calculated by Claim 1 and the variances of the T -statistic estimated by Monte Carlo simulation. Indeed, small amounts of correlation in the columns can cause a dramatic increase in the variance of the T -statistic. We have shown how the distribution of T and Z-statistics behave when columns or arrays are correlated. When analyzing microarrays, however, many have advocated using non-parametric null distributions estimated by permuting the class labels (Dudoit et al., 2003; Storey and Tibshirani, 2003; Tusher et al., 2001). This approach is also problematic, however, when the columns or arrays are not independent. As the under the null hypothesis, the joint distribution of the columns is not invariant under permutations, the randomization hypothesis fails (Lehmann and Romano, 2005). Therefore, inferences drawn by using permutation null distributions instead of theoretical nulls suffer from the same troublesome effects of unanticipated column correlations. Our brief study of the behavior of two-sample test statistics in the presence of unantic-

7 Inference and Dependence 7 ipated column correlations reveal several problematic behaviors. Relatively small column correlations can have a large effect leading to over or under dispersion of the theoretical null distribution and incorrect inferences. In the supplementary materials, we use our matrixvariate model to study the performance of large-scale inference methodology (in particular, the step-up procedure (Benjamini and Hochberg, 1995), permutation procedure (Storey and Tibshirani, 2003), and the empirical null based local FDR procedure (Efron, 2007)) under several row and column correlation scenarios. These results reveal that when both the rows and the columns are correlated, these problems are exacerbated. Specifically, further overdispersion of null distribution occurs leading to (i) biased estimates of the FDR and (ii) greater variance of FDR estimates whose effect is even greater than when only the rows or only the columns are correlated. Therefore, a troubling picture emerges regarding the statistical perils of performing large-scale inference on a data matrix in which the rows and columns may be correlated Microarrays & Unanticipated Array Correlations Before continuing with our proposal to solve the problems associated with two-way dependencies and large-scale inference, we pause to understand and quantify some possible sources of unanticipated array correlations in microarray data. We consider models for three possible sources of array correlations in microarray data: a batch-effect model (Li and Rabinovic, 2007; Leek et al., 2010a), a latent variable model (Leek and Storey, 2008), and an instrument drift model (Fare et al., 2003). Clearly, if these sources of array correlations are known to the statistician, they should be modeled directly. Unfortunately, this information is often missing and we seek to understand the effect of these correlations if they cannot be directly accounted for. Hence, we quantify how array correlations are induced if one fits a standard model assuming that the arrays are independent when they are in fact distributed according to these other models. In other words, we calculate the array covariance resulting from model bias. Consider a standard model for microarray data assuming that the arrays are independent with Gaussian noise: X ij = S ij + ɛ ij, where S ij denotes the fixed effects from the signal of interest and ɛ ij is a random effect, ɛ ij N(0, 1). Then, the expectation of the crossproducts of population residuals, r (S) ij = X ij S ij = ɛ ij, is obviously E(r (S) ij, r(s) ij ) = 0 for j j. These cross-products are non-zero, however, for the other models we consider. Let us consider the following batch effects model: X ij = S ij + K k=1 β ki (k I(k)) +ɛ ij where I(k) denotes the batch membership and β k N(µ k, σ 2 k ) independent of ɛ ij. Thus, the batch effect is a random effect given by β k. Defining the population residuals, r (B) ij, in the same manner as above, we see { that the expected cross products are non-zero for arrays in the same batch: E(r (B) ij r (B) ij ) = µ 2 k + σ2 k (j, j ) I(k) Hence, if either the mean batch effect or the 0 otherwise. additional variance among arrays in the batch are large, then strong correlations among the arrays can result. Similarly, consider the following latent variable model: X ij = S ij + K k=1 Γ ikg kj +ɛ ij where G ik is the fixed latent variable with random weights Γ ik N(0, 1) independent of ɛ ij. Then, the expected cross products of the population residuals, r (L) ij given by: E(r (L) ij a random walk with drift model. Define D ij = µ + j r(l) ij ) = K k=1 G kjg kj. To measure the effect of instrument drift, we employ k=1 (D ik + ψ k ) with D i1 N(0, σ 2 ) and ψ k N(0, σ 2 ) independent of ɛ ij, and µ the fixed instrument drift. Then, consider the following instrument drift model: X ij = S ij + D ij + ɛ ij. Again, the expected cross product of the population residuals, r (D) ij are: E(r (D) ij r (D) ij ) = (j 1)(j 1)µ 2 +σ 2 (j j ). Here, j j are

8 8 G. I. Allen & R. Tibshirani denotes the minimum of j and j. Calculations for all of these covariances are in Appendix A. Based on these simple models for microarray data, we see that large correlations among the arrays can be induced with relatively small batch effects, latent variable effects or instrument drifts if these effects are not explicitly modeled. Putting this together with the results discussed in the previous section, these small non-zero correlations can lead to dramatically wider null distributions of common tests statistics. This in turn, leads to many more genes being rejected than are truly differentially expressed. This illustration then serves as motivation for methods of directly addressing two-way dependencies when conducting large-scale inference. 3. De-Correlating a Matrix Prior to Conducting Inference We propose a novel method for solving the numerous problems associated with conducting large-scale inference on matrix data in which the rows and columns may be correlated. The approach has three simple steps: (1) Estimate the signal in the data and subtract this to get an estimate of the two-way correlated noise. (2) Simultaneously estimate separable row and column covariances of the noise and use these estimates to sphere or de-correlate the noise. (3) Add the de-correlated noise and the signal to yield the sphered data on which one conducts large scale inference. Before we introduce our methodology, we briefly review the results and existing literature illustrating the challenges of estimating both row and column covariances from a single matrix of data. First, the problem of estimating these two separable covariances for multiple instances of independent and identically distributed matrix data has been established by Dutilleul (1999). The number of repeated matrix instances, however, must be large relative to the row and column dimensions. In our case, we have only one replicate of size mn from which to estimate m(m 1) 2 + n(n 1) 2 parameters. Furthermore, the empirical estimates of the row and column covariances share the same information content. Assume that the data, X has been row and column centered and decompose the data according the the singular value decomposition giving X = U D V T. Then, the empirical covariances, ˆΣ = X X T /m and ˆ = X T X /n, can be written as ˆΣ = U D 2 U T /m and ˆ = V D 2 V T /n. That is, the empirical covariances share the same eigenvalues. Efron (2009) goes on to show that the variances of the elements of the two empirical correlation matrices are same. Muralidharan (2010) likens this problem to estimating the variance of two random variables having only observed their sum. Given this, there are several important notes to discuss. First, if the underlying data is truly matrix-variate, meaning that neither the rows nor the columns are independent, then simply estimating the row covariance or the column covariance is insufficient. This occurs as non-zero column correlations influence the apparent row correlations and vice versa, a point discussed in detail in Efron (2009). If our ultimate goal is to de-correlate the data matrix before conducting inference, then only estimating the row covariance or only the column covariance would lead to erroneous conclusions when the data is truly transposable. Additionally, estimating the row and column covariances separately would lead to these same problems. Finally, the empirical estimates of at least one covariance matrix, and likely both, are necessarily singular. As the inverse covariance matrix is needed to de-correlate the data, this presents an additional challenge. We propose to estimate non-singular row and column covariances, ˆΣ and ˆ, simultaneously via the Transposable Regularized Covariance Models framework introduced in Allen and Tibshirani (20). Then, we use these estimates to de-correlate the noise of the data matrix, Ñ = ˆΣ 1/2 N ˆ 1/2, yielding a new noise matrix, Ñ which has approximately

9 Inference and Dependence 9 independent rows and columns. (We note that if N N n,p (0, 0, Σ ), for example, then Σ 1/2 N 1/2 N n,p (0, 0, I (n), I (p) ) (Gupta and Nagar, 1999)). Finally, this new noise matrix is added to the signal estimated from the model to conduct large-scale inference Review: Transposable Regularized Covariance Models The Transposable Regularized Covariance Model (TRCM) allows us to estimate non-singular row and column covariances simultaneously by maximizing a penalized log-likelihood of the matrix-variate normal distribution (Allen and Tibshirani, 20). The model places a matrix-convex penalty on the inverse covariances, or concentration matrices of the rows and columns, allowing one to estimate non-singular covariances. For the ultimate purpose of de-correlating a data matrix when conducting inference, we choose to employ an l 1 -norm penalty on the concentration matrices. This is done for both practical and theoretical reasons which we will discuss shortly. First, let us review the model, some of its properties, and the algorithms used to find the penalized MLE. Following from the matrix decomposition model, (1), if we let N be the noise matrix remaining after removing the means and the signal in the data, then the penalized log-likelihood is as follows: l(σ, ) = n 2 log Σ-1 + m 2 log tr ( Σ -1 N -1 N T ) λm Σ -1 1 λn -1 1 (2) where -1 1 = n n i=1 j=1-1 ij and λ is a penalty parameter controlling the amount of sparsity in the concentration matrices. Notice that the first three terms of (2) are the loglikelihood of the matrix-variate normal distribution (Gupta and Nagar, 1999). Hence, the model assumes that the row and column covariances are separable and the joint distribution of the noise, vec(n) N(0, Σ), or the joint covariance is given by the Kronecker product. The l 1 -norm penalties placed on the concentration matrices are an extension of the graphical lasso-type penalties to the matrix-variate framework (Friedman et al., 2007; Rothman et al., 2008). These penalties encourage zeros in the off-diagonals of the concentration matrices corresponding to a selection of edges in a graph structure and indicating that the variables are conditionally independent (Dempster, 1972). Allen and Tibshirani (20) showed that successive application of the graphical lasso algorithm which solves the sub-gradient equations of (2) converges to a maximum of the penalized log-likelihood. The resulting estimates for ˆΣ 1 and ˆ 1 are necessarily non-singular, as desired. Before discussing the rationale for our model with l 1 penalties, we pause to address the logical question, why not use l 2 -norm penalties as also presented in Allen and Tibshirani (20)? Recall that the l 2 -norm TRCM solutions for ˆΣ and ˆ have the same eigenvectors as their empirical counterparts. The solution for the eigenvalues are simply regularized versions of the empirical eigenvalues. From results in random matrix theory, we know that the empirical eigenvectors and eigenvalues of covariances can be inconsistent with highdimensional data (Johnstone, 2001; Johnstone and Lu, 2009). Furthermore, suppose that we were to de-correlate the noise by left and right multiplying by the matrix square root of ˆΣ 1 and ˆ 1. Let N = U D V T be the SVD of the noise, and Λ ˆΣ and Λ ˆ be the diagonal matrix of eigenvalues of ˆΣ and ˆ respectively. Then, the sphered noise, Ñ, that would result from using the l 2 TRCM estimates has the following form: Ñ = ˆΣ 1/2 N ˆ 1/2 = U Λ 1/2 ˆΣ ( U T U D V T V Λ 1/2 ˆ VT = U Λ 1/2 ˆΣ ) D Λ 1/2 V T. ˆ

10 10 G. I. Allen & R. Tibshirani Hence, using the l 2 -norm TRCM estimates to de-correlate the noise returns a matrix with the same singular vectors as the original noise. Also the resulting singular values are simply a regularized version of the original singular values of the noise, a result which can be calculated using the formulas for for Λ ˆΣ and Λ ˆ in Allen and Tibshirani (20). Therefore, employing l 2 -norm TRCM estimates changes the scale of the noise instead of projecting the noise onto directions that yield approximately independent rows and columns. Using the l 1 -norm penalty in the TRCM framework, however, has many practical advantages in the context of large-scale inference. First, one usually assumes that the columns are independent, so having = I should be our default position. As the penalty encourages sparsity in the off-diagonals of -1, estimating a diagonal covariance is a special case of this model. Furthermore, notice that the penalty parameter, λ is modulated by the dimension of the rows and columns. (We note that λ can be estimated via cross validation using the efficient alternating conditional expectations algorithm (Allen and Tibshirani, 20).) Thus, the evidence of partial correlations among the columns must be strong relative to the partial correlations among the rows in order for non-zero column correlations to be estimated. Secondly, especially in the context microarrays, it seems reasonable to assume that the covariance among the genes is sparse as biologically, genes are likely only to be correlated with genes in the same or related pathways. There is also a theoretical foundation motivating the use of the l 1 -norm TRCM estimates. These estimates, by encouraging sparsity, regularize both the eigenvectors and eigenvalues. This turns out to be important for covariance estimation. For multivariate normal data, covariance estimators resulting from the graphical lasso penalty are consistent in both the Frobenius norm and operator norm for estimating a true underlying sparse matrix (Rothman et al., 2008). Note that while convergence in the Frobenius norm gives convergence of the eigenvalues, convergence in the operator norm implies convergence of the eigenvectors (El Karoui, 2008). Additionally, regularizing eigenvectors using sparsity has been shown to yield consistent directions for dimension reduction (Johnstone and Lu, 2009). While these results are for multivariate data, we note that both the feature space and the sample space are permitted to increase as long as they increase at a constant ratio asymptotically. Given this and from our experience with the l 1 -norm TRCM estimates, we conjecture that under the right assumptions, one may prove that the ˆΣ and ˆ that maximize (2) are consistent for estimating sparse separable inverse covariance matrices of the matrix-variate normal. We leave this open problem for future work Sphering Algorithm We develop a simple method to directly address the problems associated with inference on data exhibiting row and column correlations: We de-correlate the underlying data before conducting inference. Among the many advantages of this approach is that (i) it can be used with any test statistic and (ii) with any method of controlling for multiple testing. Algorithm 1 Sphering Algorithm (a) Estimate row and column means, ˆν and ˆµ forming ˆM, and the signal matrix, Ŝ. (b) Define the noise, ˆN X ˆM Ŝ. Estimate row and column covariances of noise, ˆΣ and ˆ via TRCM. (c) Sphere the noise: Ñ ˆΣ 1 2 ˆN 1 2 ˆ. Form the sphered data matrix: X Ŝ + Ñ. Our sphering algorithm, based on the matrix decomposition model (1), is given in Algorithm 1. This algorithm simply removes the means and signal, estimates the row and column covariances of the noise, and uses these to de-correlate or sphere the noise. The estimated

11 Inference and Dependence 11 [ ] signal, Ŝ, is problem specific. For the two-class model, for example, Ŝ = ˆψ1 1 T (n 1) ˆψ 2 1 T (n 2) where ˆψ 1 and ˆψ 2 are the vectors of estimated class means for each row. Note that we use the symmetric square root defined by the following: Let ˆΣ 1 = PΛP T be the eigenvalue decomposition of ˆΣ 1, then the symmetric matrix square root is given by ˆΣ 1/2 = PΛ 1/2 P T. Then, adding the signal back into this sphered noise, we obtain X which we call the sphered data. Thus, the sphering algorithm solves the problems associated with row and column correlations by sphering the underlying data. As this algorithm operates on the original data matrix, it can be used with any test statistic and any multiple testing procedure. To better understand the algorithm, however, we investigate some of its properties for the two-class problem: Proposition 1. Let X N m,n (M + S, Σ, ) where M = ν1 T (n) + 1 (m)µ T and S = [ψ 1 1 T (n 1) ψ 21 T (n 2) ] and let X be the sphered data given by Algorithm 1. Then, (i) E( X) = S = [ψ 1 1 (n1) ψ 2 1 (n2)], d (ii) If in addition, we take some N 0 = N independent of N and define Ñ 0 = ˆΣ 1/2 N ˆ 1/2 0, then, Ñ 0 ˆΣ, ˆ ) N m,n (0, 0, Σ,, where Σ = ˆΣ 1 2 Σ ˆΣ 1 2 and = ˆ 1 2 ˆ 1 2. Thus, the signal remains the same between X and X, and the covariance structure is all that changes. Each row of X then becomes a linear combination of the other rows weighted by their partial correlations. The same applies to the columns. Now, let us study how sphering the data affects the Z and T -statistics from Section 2.2. First, the Z-statistic does not change with sphering. The numerator of both the Z and T statistic, x 1,i x 2,i is given by ˆψ 1,i ˆψ 2,i, the components of the estimated signal matrix Ŝ. The denominator of the T -statistic, namely s x 1,x 2, the estimate of the noise, however, changes with sphering. Thus, since the denominator of the T -statistic changes with sphering, the ranking of the rows changes as well. This is an important point which we will discuss in more detail subsequently. Recall also that in Section 2.2, we noted that the T -statistic does not have a closed form distribution when there are column correlations. After sphering the data, however, the T -statistic on the sphered data approximately follows a scaled t distribution under certain conditions: Claim 2. Assume the assumptions in Proposition 1 (ii) hold. In addition, let X be the sphered data defined by Ñ0 + Ŝ, and let the statistic T i be the statistic for the i th row for the data X. Then under the null hypothesis H 0 : ψ 1,i = ψ 2,i, if = I, Ti σ i η t (n 2). σ i c n Here, c n and η are defined as in Claim 1. Using our sphering algorithm to de-correlate the noise in the data matrix, we obtain test statistics that follow approximately known distributions under certain conditions. The sphered column covariance,, is assumed to be the identity. If is instead a diagonal matrix, then a simple scaling of the columns will give the above result. Notice that if the original data, X, has no column correlations, = I, then T and T both approximately follow a scaled t distribution with n 2 degrees of freedom. Thus, if the data originally follows the correct theoretical null distribution, then sphering the data does not change its null distribution, an important property. Also, if the sphered rows are independent, Σ = I, or approximately independent, then the statistics, T i are independent or approximately

12 12 G. I. Allen & R. Tibshirani independent. We also note that we can often assume that σ i = σ i, thus eliminating that coefficient ratio from the distribution. This is especially a reasonable assumption if the rows are scaled prior to applying the sphering algorithm. While and Σ are not likely to be exactly the identity, we have observed in simulations that these are often diagonal or nearly diagonal. When calculating p-values for T based on the distribution given in Claim 2, we must know the value of η which depends on the original column covariance. While one might be inclined to estimate η from ˆ, this is problematic for several reasons. First, ˆ is the penalized MLE, meaning that the estimate is biased for finite samples and the exact formula for this bias has not yet been established. Thus, estimating η in this manner would result in a global underestimate of the population variance, η. Secondly, ˆ and ˆΣ are only identifiable up to a multiplicative constant (Allen and Tibshirani, 20). Hence, the scale of the variances of the columns are not separable from that of the rows, meaning that one cannot determine the variance associated with the columns, η, from the TRCM estimates. Further research on the consistency of the estimates for ˆ and ˆΣ are needed as well as investigations into estimating η directly. For our proposes then, we propose to estimate η and hence re-scale the distribution of the sphered test statistics, T, in a data-driven manner. Note that if all of the test statistics were truly from the null distribution, then we could simply re-scale the test-statistics to have the same variance as that of the t (n 2) distribution. As we often expect a portion of the tests to be non-null, however, we do not want these tests to contaminate our variance estimate. Thus, we propose to scale by only the central portion of the observed distribution of test statistics, as these are mostly likely to be truly null tests. More specifically, we estimate η by comparing the variance of the central portion of the t (n 2) distribution to that of the central portion of the T -statistics. This procedure is outlined in Algorithm 2 where ρ α (x) denotes the α th quantile of x and I() is the indicator function. Algorithm 2 Scaling by the central portion of T. (a) Let the expected proportion of null test statistics be ˆπ 0 = ˆm 0 /m. (b) Estimate the variance of the central portion of sphered test statistics: ˆσ 2 T (ˆπ 0 ) ˆ Var [ T I ( T ρ((1 ˆπ0)/2)( T ), T ρ (1 ˆπ0)/2( T ) (c) Define the central-matched T -statistics: T T σ t(n 2) (ˆπ 0 )/ˆσ T (ˆπ 0 ), where σ 2 t (n 2) (ˆπ 0 ) is the variance of the central portion of the t (n 2) distribution. )] The estimate for the scaling factor η, is then ˆη = c n σ 2 T (ˆπ 0 )/ˆσ 2 t (n 2) (ˆπ 0 ). Thus, the resulting T -statistics can be tested against the t (n 2) distribution. As we do not want statistics corresponding to non-null tests to contaminate the variance estimates, we recommend using a conservative estimate of π 0, such as 0.8 or 0.9 for microarrays. We pause to ask a logical question. Based on the results in Section 2.2, why does one need to sphere the data and then estimate η instead of simply estimating η via Algorithm 2 at the onset? Recall that the result in Claim 1 giving the altered variance of the Z-statistic was for a single test associated with a single row of the data matrix. Also, recall that over or under dispersion of the test statistic can result from correlation among the rows alone (Qiu et al., 2005; Efron, 2010). Thus, if the row variables are left un-sphered, then estimating η via central matching will result in a biased estimate that leads to incorrect inferences. As a brief illustration of this effect, we applied central matching to the original data in an analysis of real microarray data in the next section.

13 Inference and Dependence 13 Our sphering algorithm takes a simple but direct approach to the problems associated with correlation and large-scale inference. The noise in the data is de-correlated using the TRCM estimates of the row and column covariances so that the resulting sphered noise is approximately independent. While we have mainly discussed the distributions of standard two-sample test statistics resulting from our algorithm, we note that as our approach works with the original data matrix, it is general and can be used in conjunction with any test statistic and any multiple testing procedure. 4. Results We evaluate the performance of our sphering method through simulations and a real microarray example in which batch effects have been documented Simulations We test our method of directly accounting for row and column correlations when conducting inference on simulated data sets and compare the results to those of competing methods. Four simulation models are employed for this purpose: 1) our matrix-variate model with correlation structure inspired by that of the Spielman et al. (2007) microarray data, 2) a latent variable model, 3) a batch-effect model, and 4) an instrument drift model. Data of dimension is simulated according to the following model: X = S + B + N for signal matrix S, effect matrix B and noise matrix N. The signal is constant for all four simulation models and is that of a two-class model with 25 columns in each class and 50 non-null rows: S = [ψ 1 1 T (25) ψ 2 1 T (25) ] where ψ 1,1:25 = 0.5, ψ 1,26:50 = 0.5, ψ 2,1:25 = 0.5, ψ 2,26:50 = 0.5 and ψ 1,51:250 = ψ 2,51:250 = 0. For the matrix-variate model, the effect matrix, B = 0 and the noise matrix, N = Σ 1/2 Z 1/2 with Z ij N(0, 1). The row and column covariances are inspired by the correlation observed in the Spielman et al. (2007) data: Σ and are taken as the correlation matrices of 250 randomly sampled genes and 50 columns randomly sampled according to the class labels. The latent variable simulation model is taken from Leek and Storey (2008) and consists of the noise matrix, N ij N(0, 1) and the effect matrix, B = Γ G. Here, the latent variables, G, of dimension 2 50 are given by G ij Bern(0.5) and the weights, Γ of dimension are given by Γ ij N(0, (.5) 2 ). For the batch effect model, B ij = K k=1 β iki (j I(k)), where I(k) indicates the k th batch membership and β ik N(µ k, (.5) 2 ). We simulate K = 5 batches with ten members each and µ = [ 0.5, 0.25, 0, 0.25, 0.5]. The noise, N = Σ 1/2 Z where Σ ij = (.9) i j and Z ij N(0, 1) independent of the effect matrix. Finally, the effect matrix for the instrument drift model is given by B i,j = µ + γb i,j 1 + Z i,j where the drift, µ = 0.01, the shrinkage, γ = 0.1 and the innovations Z ij N(0, (.1) 2 ). As with the batch effect model, the noise, N = Σ 1/2 Z. We compare the results of our sphering algorithm to those of competing methods, specifically to standard methodology (row variables are standardized and column variables are centered), surrogate variable analysis (Leek and Storey, 2008), the correlation sharing method (Tibshirani and Wasserman, 2006), the correlation predicted method (Lai, 2008), and the correlation adjusted method (Zuber and Strimmer, 2009). Note that all of these methods reorder the rank of the row variables compared to that of standard methodology. This means that the true false discovery proportion (FDP) will change for each method depending on whether the true non-null rows are re-ordered correctly. Thus, we compare results by fixing the number of tests rejected and comparing the true FDP and estimated FDR, estimated via the step-up procedure (Benjamini and Hochberg, 1995). Best performing methods will

14 14 G. I. Allen & R. Tibshirani Matrix-variate Model Standard Sphered SVA Correlation Correlation Correlation Sharing Predicted Adjusted FDP FDR FDP FDR FDP FDR FDP FDR FDP FDR FDP FDR 40 tests tests tests tests tests Latent Variable Model Standard Sphered SVA Correlation Correlation Correlation Sharing Predicted Adjusted FDP FDR FDP FDR FDP FDR FDP FDR FDP FDR FDP FDR 40 tests tests tests tests tests Batch Effect Model Standard Sphered SVA Correlation Correlation Correlation Sharing Predicted Adjusted FDP FDR FDP FDR FDP FDR FDP FDR FDP FDR FDP FDR 40 tests tests tests tests tests Instrument Drift Model Standard Sphered SVA Correlation Correlation Correlation Sharing Predicted Adjusted FDP FDR FDP FDR FDP FDR FDP FDR FDP FDR FDP FDR 40 tests tests tests tests tests Fig. 4. Simulation results comparing the average true false discovery proportion (FDP) to the average estimated false discovery rate ( F DR) over 100 replicates of the four simulation models for the six methods as described in Section 4.1. Results are compared when the number of rejected tests are fixed between 40 to 60 tests out of 250 total tests, with 50 tests being truly non-null. Rejecting 55 tests corresponds to controlling the oracle FDP at 10%. Best performing methods for each model, meaning that the most true non-null tests are rejected while still controlling the true FDP, are denoted in bold.

15 Inference and Dependence 15 re-order the rows such that the true FDP is lower, meaning that the statistical power is higher, while the FDR is well-estimated or slightly conservative. Methods yielding anticonservative FDR estimates, meaning that the FDR under-estimates the true FDP, exhibit problematic tendencies as controlling the FDR does not imply controlling the proportion of false discoveries. In Table 4, we display the average true FDP and estimated FDR for the six comparison methods on the four simulations models over 100 replicates for fixed numbers of rejected tests. As controlling the oracle FDP at 10% corresponds to rejecting 55 tests, we present boxplots of the true FDP and estimated FDR in Figure 5 when 55 tests are rejected. These results reveal that the sphering algorithm performs well in comparison to the other five methods. In the matrix-variate, batch effect and instrument drift models, the sphering algorithm re-orders the row rankings in such as way as to lower the true FDP, yielding an increased in statistical power. In addition, the estimated FDR is a more consistent estimate when sphering is used, allowing one to reject more truly non-null rows than other methods. Notice also that sphering decreases the variance of the FDR estimates compared to standard methodology. The correlation adjusted method (Zuber and Strimmer, 2009) generally results in a favorable re-ordering of the row rankings, but leads to an FDR estimate with larger variance. All competing methods exhibit troubling behavior in the matrix-variate simulation. Here, all methods estimate that there are no false discoveries when in fact there are at least five false discoveries. Also, competing methods such as SVA, the correlation sharing and correlation predicted exhibit these problematic behaviors in at least two of the simulation models. In microarray analysis, this behavior would lead to identifying too many genes as significant when many are likely to be false discoveries. Overall, our sphering algorithm is the most consistent and most robust method for conducting inference on matrix data with both row and column correlations Results: Real Microarray Study We compare the performance of our sphering algorithm to that of competing methods on a real microarray data set in which strong batch effects have been documented. The data presented in Spielman et al. (2007) measures the gene expression of 4,167 genes for 142 subjects with 60 of European ancestry (CEU) and 82 of Asian (ASN) ancestry. Spielman et al. (2007) find that 78% of genes are differentially expressed between the CEU and ASN groups. Subsequent work, however, has questioned the validity of these results due to strong batch effects (Akey et al., 2007; Leek et al., 2010a). Specifically, the microarrays were processed from years , with the bulk (44/60) of the CEU group processed in years and all of the ASN group processed in years Due to the strong batch effects measured by the processing year and the confounding between these batches and the two classes, it is difficult to determine the set of truly differentially expressed genes. In fact after removing the batch effects, no genes are found to be significant (Akey et al., 2007). On this data, we apply our sphering algorithm, standard methodology, and the four competing methods described in the previous section to determine the number of genes that are significantly differentially expressed between the two groups. Prior to analysis, the genes and arrays are centered. Two-sample t-statistics with the variance correction as proposed in Tusher et al. (2001) are used. Test statistics are compared to the permutation distribution (Storey and Tibshirani, 2003) to assess significance and correct for multiple testing. The FDR is controlled at 10%. While the batches are known for this data set, we ignore this information when comparing methods to test the performance of each method. In Table 6, we present the number of significant genes found by each method. We find that 49% of genes are differentially expressed using the standard methodology, but after

16 16 G. I. Allen & R. Tibshirani Matrix variate Model True FDP Estimated FDR Standard Sphered SVA Correlation Sharing Latent Variable Model Correlation Predicted Correlation Adjusted True FDP Estimated FDR Standard Sphered SVA Correlation Sharing Batch Effect Model Correlation Predicted Correlation Adjusted True FDP Estimated FDR Standard Sphered SVA Correlation Sharing Instrument Drift Model Correlation Predicted Correlation Adjusted True FDP Estimated FDR Standard Sphered SVA Correlation Sharing Correlation Predicted Correlation Adjusted Fig. 5. Boxplots of the true false discovery proportion (FDP) and the estimated false discovery rate (FDR) for six methods under four simulation models each repeated 100 times. For each method, the number of tests rejected is fixed at 55 out of 250 tests corresponding to controlling the oracle FDP at 10%, as shown with the dotted line. Methods and simulation models are described in Section 4.1. Best performing methods reorder the tests such that the true FDP is close to the oracle FDP, while the estimated FDR is a good or slightly conservative estimate of the true FDP with small variance. The best methods are then our sphering method for the matrix-variate, batch effect and instrument drift models and the SVA method for the latent variable model.

17 Number of Significant Genes Standard 2040 Standard (with Central Matching) 80 Standard after removing Batch Effects 0 SVA 2787 Correlation Sharing 4167 Correlation Predicted 248 Correlation Adjusted 178 Sphering 28 Inference and Dependence 17 Fig. 6. Number of significant genes found in the Spielman et al. (2007) data. Two-sample T -statistics with a variance adjustment were used, and the FDR was controlled at 10%. When the batch labels are known and the batch effects removed, no genes are found significant. Assuming the batch labels are unknown, competing methods find many more genes significant. Our sphering method appropriately estimates and removes the effect of the batches finding only 28 genes significant. removing the batch effects by standardizing the arrays with respect to processing year, no genes are found to be significant. These results are consistent with previous re-analysis of this data (Akey et al., 2007). When employing the surrogate variable analysis and correlation sharing methods, however, even more genes are found to be significant at 67% and 100% respectively. The correaltion predicted and correlation adjusted methods find 6% and 4% signfiicant genes, while our sphering algorithm only finds 0.67% signficant genes. Notice also that while central matching adjusts for some of the effects of correlation, since it does not re-order the gene rankings, it still finds 1.9% of genes significant. Thus, even without using knowledge of the processing years, the sphering method correctly adjusts for these batch effects, finding very few genes differentially expressed. The SVA and correlation sharing methods, however, display troubling behavior as they estimate that even more genes are significant than the standard methodology. These real data results are consistent then with what we have previously observed in the simulation study. To explore these results further, we present heatmap dendograms of the top 250 genes for the original data, the data resulting from the SVA method, and our sphered data in Figure 7. Colorbars indicating the processing years as well as the group labels are shown with these heatmaps. From this, we see that the arrays in the original data cluster by processing years or years instead of by the group status. When the SVA method is applied, this effect appears to be exacerbated as the separation between processing years is more pronounced. In contrast, the sphered data clusters the arrays by the group status and not the processing years. In addition, the arrays do not cluster by the processing years or indicating that our method removed these strong batch effects. Instead, the true problem is clearly illustrated, namely there is confounding between processing years and group status. These results on a real microarray example provide strong evidence for the utility of our sphering approach, revealing that the technique outperforms all competing methods. 5. Discussion In this paper, we have demonstrated that using standard statistical methodology to conduct inference on transposable data is problematic. As a method of solving these problems, we have prosed a sphering algorithm that de-correlates the data yielding approximately independent rows and columns to be used before conducting large-scale inference. We have demonstrated the advantages and robustness of this method through simulations on many correlated data sets.

18 G. I. Allen & R. Tibshirani 18 Fig. 7. Heatmap dendograms of the top 250 genes for the original standardized Spielman et al. (2007) data (left), the data after removing latent variables via the SVA method (center), and after sphering the data (right). Samples are labeled according to the year in which the arrays were processed (upper colorbar) and the class labels (lower colorbar). In the original data and the SVA method, samples cluster largely according to processing years, or Samples in the sphered data do not seem to cluster according to processing year, and instead, confounding of the class labels and processing years is clearly seen.

A Generalized Least Squares Matrix Decomposition

A Generalized Least Squares Matrix Decomposition Genevera I. Allen Department of Pediatrics-Neurology, Baylor College of Medicine, & Department of Statistics, Rice University, Houston, TX. Logan Grosenick