arxiv: v1 [stat.ot] 7 Jul 2010

Size: px

Start display at page:

Download "arxiv: v1 [stat.ot] 7 Jul 2010"

Oswin Jennings
5 years ago
Views:

1 Hotelling s test for highly correlated data P. Bubeliny e-ail: bubeliny@karlin.ff.cuni.cz Charles University, Faculty of Matheatics and Physics, KPMS, Sokolovska 83, Prague, Czech Republic, arxiv: v [stat.ot] 7 Jul 200 Abstract: This paper is otivated by the analysis of gene expression sets, especially by finding differentially expressed gene sets between two phenotypes. Gene log 2 expression levels are highly correlated and, very likely, have approxiately noral distribution. Therefore, it sees reasonable to use two-saple Hotelling s test for such data. We discover soe unexpected properties of the test aking it different fro the ajority of tests previously used for such data. It appears that the Hotelling s test does not always reach axial when all arginal distributions are differentially expressed. For highly correlated data its axial is attained when about a half of arginal distributions are essentially different. For the case when the correlation coefficient is greater than 0.5 this test is ore ful if only one arginal distribution is shifted, coparing to the case when all arginal distributions are equally shifted. Moreover, when the correlation coefficient increases the of Hotelling s test increases as well. Introduction In any situations ians need to test ultidiensional hypotheses. In a lot of cases coponents of observed rando vectors are highly dependent, which ay change the properties of the tests used. One of the exaples of such data is provided by gene expression levels. Gene expressions are highly correlated between genes (see for exaple Klebanov and Yakovlev (2007)). Moreover, often the genes are investigated not just separately, but also as a set of dependent genes. Therefore one has to deal with ultidiensional hypotheses and in order to test such hypotheses, gene sets should be expressed differentially. The ost popular tests for gene sets are Hotelling s test, N-test and tests derived fro arginal t-s. In the papers Ackerann and Strier (2009), Glazko and Eert-Streib (2009), an approach to coparing these test in various situations was ade. Our goal is not to ake another coparison, but rather to describe soe interesting properties of the Hotelling s test which sees to be unexpected. 2 Hotelling s test One of the ost well known tests is t-test. Hotelling s test is an ultidiensional extension of t-test. Siilar to t-test, we can consider both one-saple and two-saple Hotelling s test. One-saple case deals with the hypothesis that the expected value of a saple fro ultidiensional noral distribution is equal to soe given vector. In the two-saple case it deals with the hypothesis of the equality of expected values of two saples fro ultidiensional noral distributions (with the equal covariance structure). In this paper we will focus on the two-saple Hotelling s test. Suppose we have two independent saples (of sizes n x and n y, respectively) fro two n-diensional noral distributions with identical covariance atrices equal toσ. In other words, we consider X,..., X nx as i.i.d rando vectors having N n (µ x,σ) and Y,..., Y ny as i.i.d rando vectors having N n (µ y,σ) (X i and Y j are independent for all i=,..., n x ; j=,..., n y ). For siplicity we assue that n<n x + n y. Our goal is to test the hypothesis H :µ x =µ y against alternative A :µ x µ y. For this we use Hotelling s test based on the T 2 = n xn y n x + n y ( X Ȳ) T S ( X Ȳ), ()

2 where X = nx n x i= X i; Ȳ = ny n y i= Y i and S = distribution by nx i= (X i X)(X i X) T + ny i= (Y i Ȳ)(Y i Ȳ) T n x +n y 2. T 2 is related to the F- n x + n y n n(n x + n y 2) T 2 F(n, n x + n y n ). (2) For ore details about Hotelling s test see, for exaple, Chatfield and Collins (980). We ade the assuption n<n x + n y for two reasons. For n n x + n y the estiate S ofσresults in an irregular atrix, so that S does not exist and oreover nuerator of (2) is non-positive as well as the degree of freedo of the F-distribution. In such situations it is possible to use soe pseudo-inversion of S and in order to estiate p-value of H, we can use perutations of (X,..., X nx, Y,..., Y ny ). 3 Hotelling s test for strongly dependent data As it was entioned above, genes are highly dependent and we will suppose that their log 2 expression levels have approxiately noral distributions. Many papers work with gene sets (for exaple Barry et al. (2008)) instead of genes alone and therefore deal with ultidiensional hypotheses. It sees to be reasonable to use Hotelling s test in this situation. Assue that we have two ultidiensional saples and need to test the hypothesis suggesting the equality of expected values in these two saples. Assue for siplicity that all eleents on the ain diagonal of the covariance atrixσfor both saples are equal to and all other eleents are equal toρ>0, i.e. ρ ρ... ρ ρ Σ= ρ... ρ ρ ρ Further on, we assue thatµ x = (0,..., 0) T, butµ y has first eleents equal to and the others equal to 0, i.e. µ y = ( ) T.,...,, 0,..., 0 } {{ }} {{ } n For large n x and n y the atrixσand its estiate S are approxiately the sae as well as the differences between the expected values (µ x µ y ) and between the ean values ( X Ȳ). When dialing with real data, n x and n y ight not be large enough, but for theoretical reasons we ay use the approxiations S Σ and X Ȳ µ x µ y. In this case S Σ, that is S Σ =. α β β... β β α β... β β β α whereα= (+(n 2)ρ) ( ρ)(+(n )ρ) andβ= ρ ( ρ)(+(n )ρ). For fixed n x and n y we can consider the fraction n xn y n x +n y = k of Hotelling s () as a noralizing constant. Let us denote T 2 Hotelling s withσ instead of S andµ x µ y instead of X Ȳ divided by the constant k. Therefore, we have T 2 /k T 2 = (µ x µ y ) T Σ (µ x µ y ), 2

3 = (,...,, 0,..., 0 } {{ }} {{ } n ) α β β... β β α β... β β β α = α ( 2 )β= (+(n 2)ρ) ( )ρ ( ρ)(+(n )ρ) = (+(n )ρ) ( ρ)(+(n )ρ). (3) Let us note that it does not atter ifµ y consists of ones and zeros or equals to a constant a and zeros. In the latter case, T 2 would be ultiplied by a 2. Now we will work with T 2 and investigate its behavior. If we changed to + (eaning that we add one ore different arginal distribution) we would expect that the T 2 increases and that so does the of Hotelling s test. We need to check if it is indeed the case. For better understanding let the nuber of ones inµ y be the index of T 2 (we will write it only when it is needed). Now we change to +=h and we have T 2 + = T 2 +α 2β. If we expected that T 2 is an increasing function of thenα 2 β should be greater then zero. But we have α 2β= +(n 2)ρ ( ρ)(+(n )ρ) 2ρ ( ρ)(+(n )ρ) = +(n 2 2)ρ ( ρ)(+(n )ρ). Since the denoinator is greater than zero, thenα 2β>0 only if 2h n >ρ. It eans that for not very sall values ofρ s and > n 2 the T 2 is a decreasing function of. This eans that axial of Hotelling s test (as a function of ) is not always attained for = n but for ρ s which are not very sall we have axial for near n 2. Soe exaples of the behavior of T 2 as a function of are illustrated on figure. However, this issue is not the only one that is surprising about Hotelling s test. Now we look if T 2 is always lower than Tn 2. It is the case when one different arginal distribution influences ore than all n different distributions. So we need to copare α with nα n(n )β. We have T 2 T 2 n 2+2 n = ( 2ρ) =α nα+n(n )β=(n ) ( ρ)(+(n )ρ). So T 2 T 2 n < 0 only if ρ < 0.5. Therefore we can say that for ρ > 0.5 Hotelling s test has better for alternative with only one arginal shift than for alternative that all arginal distributions are equally shifted. It can be seen fro figure as well. Moreover, the T 2 is an increasing function ofρ, that ay see surprising as well. 4 Hotelling s test for two-diensional data Let us look at Hotelling s test in the two-diensional case. As in the previous case, we will consider the two-saple proble, but now we will generalize the difference of expected values of these two saples. Suppose thatµ x µ y = (a, a 2 ) and that the covariance atrix is Σ= ( ρ ρ ). 3

4 n= 0 ; rho= 0. n= 0 ; rho= 0.3 n= 0 ; rho= 0.5 n= 0 ; rho= 0.7 n= 0 ; rho= n= 5 ; rho= 0. n= 5 ; rho= 0.3 n= 5 ; rho= 0.5 n= 5 ; rho= 0.7 n= 5 ; rho= n= 25 ; rho= 0. n= 25 ; rho= 0.3 n= 25 ; rho= 0.5 n= 25 ; rho= 0.7 n= 25 ; rho= n= 40 ; rho= 0. n= 40 ; rho= 0.3 n= 40 ; rho= 0.5 n= 40 ; rho= 0.7 n= 40 ; rho= Figure : Plots of T 2 for n=0, 5, 25, 40;ρ=0., 0.3, 0.5, 0.7, 0.9; and =,..., n. Notice: each plot is differently scaled! Then inverse ofσis the atrix with diagonal eleentsα = ( ρ)(+ρ) and off-diagonal eleents β = ρ ( ρ)(+ρ). Then T 2 =αa 2 +αa2 2 2βa a 2. First we consider that a = and a 2 = 0. Then T 2 =α. Now we will investigate for which a, a 2 R T 2 =α. That is, we need to solve an equation After dividing both sides of equation (4) byαwe get For fixed a equation (5) is quadratic in a 2 with the roots 2ρa ± (2ρa ) 2 4(a 2 ) a 2,2 =. 2 αa 2 +αa2 2 2βa a 2 =α. (4) a 2 + a2 2 2ρa a 2 =0. (5) 4

5 It is defined only if (2ρa ) 2 4(a 2 ) 0, i.e. for a. Soe plots of the solutions of the equation ρ 2 (5) for different values of the correlation coefficientρare given on figure 2. We can see that the plots of these solutions produce elliptic curves. Let us rotate these ellipses by the angle ϕ = Π/4 clockwise. To do this, we use transforation 2 2 a = x cosϕ y sinϕ= 2 x 2 y, 2 2 a 2 = x sinϕ+y cosϕ= 2 x+ 2 y, where x and y are new rotated coordinates. After substitution into (5) it gives ( 2 x 2 y)2 + ( 2 x+ 2 y)2 2ρ( 2 x 2 y)( 2 x+ 2 y) = x 2 ( ρ)+y 2 (+ρ)= x2 a 2+y2 b2=, where a= ρ and b= +ρ are respectively the ajor radius and the inor radius of the ellipse. Since a>b, the Hotelling s test has the weakest in the direction of a = a 2, while the fastest increase of its is observed towards the direction of a = a 2. For exaple, forρ=0.9 we have a=3.62 and b= It eans that for a = a 2 = = Hotelling s test has approxiately the sae = 0.53 as well). So, if there is only one arginal distri- as for a =, a 2 = 0 (or for a = a 2 = bution shifted by one unit, then the of Hotelling s test is approxiately the sae as if both arginal distribution were equally shifted (in the sae direction) by units (for the shift in opposite direction it should be only 0.53 unit). These results are in contradiction with other ultidiensional tests. For exaple, consider the test based on arginal t-s. The of this test is higher if both distributions are shifted by the sae aount (both t-s are large, not depending on direction of shift) than if there was only one arginal distribution shifted (one t- is near zero). 5 Theory and reality The analytical results obtained above should be verified by checking if actual Hotelling s test outcoes correspond to the analytical results regarding real data. In this section we will copare the behavior of theoretical Hotelling s T 2 with real Hotelling s T 2. For large n x and n y we assued that T 2 T 2 /k, where k= n xn y n x +n y. Constant k changes as n x and n y change. It is reasonable to divide Hotelling s T 2 by k instead of ultiplying T 2 by k in order to be able to copare how do T 2 and T 2 differ for various n x and n y. In order to copare the actual results with the analytical ones, we did the following siulations. All data were siulated fro n-diensional noral distributions. Consider three different values for the nuber of genes in a gene set. We take n=0, n=5 and n=25. All siulations were perfored for three different values of the correlation coefficientρ : ρ=0.,ρ=0.5 andρ=0.9. In order to copare the behavior of Hotelling s test for various sizes of saples we took three choices of n x and n y : n x = n y = n, n x = n y =.4n and n x = n y = 2.4n. The value which is the nuber of false arginal distributions varies fro one to n. The shift value for each of the different arginal distributions is set to one. The theoretical Hotelling s is calculated according to (3). Real Hotelling s is estiated fro 000 siulations for each 5

6 rho=0.25 rho=0.5 rho=0.9 a a a a a a Figure 2: Plots of solutions of equation (??) for two-diensional case for rho=0.25;0.5;0.9. Notice: each plot is differently scaled! case (as the ean of T 2 /k obtained fro the siulations). Plots of our siulated cases are shown on figure 3. We can see that for all siulated situations, the shapes of real and theoretical Hotelling s s are siilar. The only difference is in the heights of these curves. For sall n x and n y T 2 has higher values than for large n x and n y. The reason for that stes fro the inaccurate estiates of the expected values and of the covariance atrix. However, we observe that with the increase of n x and n y, T 2 /k goes to T 2 relatively fast. Therefore, the behavior of Hotelling s test for real data is expected to be very siilar to the behavior of T 2. In previous section we saw that for the two-diensional case the plotted shifts with equal values of the of theoretical Hotelling s test for elliptic curves. Hotelling s s T 2 are rando variables. Therefore, we can only estiate if their expected values for elliptic curves when plotted. To check this we did following siulations. Instead of calculating the shifts for which Hotelling s test has equal s, we took the points provided by the elliptic curves observed for theoretical Hotelling s s. For each pair of these points (a, a 2 ) we did 000 siulations and calculated Hotelling s. We estiated the expected value ET 2 /k as the ean for these 000 repetitions. We divided Hotelling s s by k for better understanding how fast these s go to T 2. We did this siulation for the values of the correlation coefficientρ=0.3 andρ=0.9 and as the nuber of observations in each saple we took n x = n y = 5, n x = n y = 0 and n x = n y = 20. Results of our siulation are given in Table. We observe that estiated ean values of T 2 /k are not very different, that they go to T 2 and that their variance decreases with increasing nuber of observations. Clearly, these points for elliptic curves. Hence, we can clai that the real Hotelling s test behaves very siilar to the theoretical one and the theory derived for the theoretical test holds for the real Hotelling s test as well. 6 Discussion In this paper we have discovered that two-saple Hotelling s test (for testing the equality of the expected values of two saples fro ultidiensional noral distribution with equal covariance structure) has soe unexpected properties. At first sight, one could expect that with a larger nuber of false arginal distribu- 6

7 n= 0 ; rho= 0. n= 0 ; rho= 0.5 n= 0 ; rho= n= 5 ; rho= 0. n= 5 ; rho= 0.5 n= 5 ; rho= n= 25 ; rho= 0. n= 25 ; rho= 0.5 n= 25 ; rho= Figure 3: Coparisons of theoretical s T 2 and real Hotelling s T 2 /k for nuber of genes n = 0 5, 25 (fro the top to the botto); for correlation coefficient ρ = 0., 0.5, 0.9 (fro the left to the right) and nuber of observations in each saple n x = n y = n (denoted by + ), n x = n y =.4n (denoted by x ) and n x = n y = 2.4n (denoted by ). The theoretical T 2 is denoted by. Nuber of different arginal distribution is set fro one to n. Notice: each plot is differently scaled! tions the of this test increases. But we have discovered that this is not true in general. For highly correlated and high diensional data (such as data sets of gene expressions) axial of Hotelling s test is reached when only about one half of the arginal distributions are shifted. We have found out that when the correlation inside the saple is greater than 0.5, then the Hotelling s test can have a better if only one arginal distribution is different, as opposed to the case when all arginal hypotheses are false. Moreover, the of Hotelling s test increases for higher correlations. That observation ay see soewhat unexpected as well. We have investigated Hotelling s test in detail in two-diensional case. We have found that properties of this test are uch different fro ones of the tests based on arginal t-. All reasonable tests based on arginal t- do not depend on the direction of the shift. But the of Hotelling s test increases very slowly if both of the arginal distributions are equally shifted and increases uch faster if arginal distributions are shifted in opposite directions. Moreover, alternatives with equal values of the for ellipsoids. 7

8 Table : Results of siulations of two-diensional adjusted Hotelling s s T 2 /k with n s = n x = n y observations for each saple and correlation coefficientρ. T 2 stands for theoretical Hotelling s s and (a, a 2 ) is difference between expected valuesµ x µ y of these saples. On botto line is the estiate of variance of each colun. T 2 =.0989 ρ=0.3 T 2 = ρ=0.9 a a 2 n s = 5 n s = 0 n s = 20 a a 2 n s = 5 n s = 0 n s = var: var: Acknowledgents The author thanks Prof. Lev Klebanov, DrSc. for valuable coents, rearks and overall help. The work was supported by the grant SVV 2635/200. References Ackerann, M and Strier, K.(2009), A general odular fraework for gene set enrichent analysis, BMC Bioinforatics, 0, 47. Barry, W.,T., Nobel, A., B., and Wright, F., A. (2008), A al fraework for testing functional categories in icroarray data, The Annals of Applied Statistics, 2 No., Chatfield, C. and Collins, A.,J. (980), Introduction To Multivariate Analysis, Chapan&Hall/CRC. Glazko, G. and Eert-Streib, F. (2009), Unite and conquer: univariate and ultivariate approaches for finding differentially expressed gene sets, Bioinforatics, 25 No. 8, Klebanov, L. and Yakovlev, A. (2007), Diverse correlation structures in gene expression data and their utility in iproving al inference, The Annals of Applied Statistics, No.2,

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are, Page of 8 Suppleentary Materials: A ultiple testing procedure for ulti-diensional pairwise coparisons with application to gene expression studies Anjana Grandhi, Wenge Guo, Shyaal D. Peddada S Notations