Journal of Multivariate Analysis. Independence tests for continuous random variables based on the longest increasing subsequence

Size: px

Start display at page:

Download "Journal of Multivariate Analysis. Independence tests for continuous random variables based on the longest increasing subsequence"

Sheena Amy Bell
6 years ago
Views:

Journal of Multivariate Analysis 127 (2014) 126 146 Contents lists available at ScienceDirect Journal of Multivariate Analysis journal homepage: www.elsevier.

González-López Department of Statistics, University of Campinas, Rua Sérgio Buarque de Holanda, 651, Campinas, São Paulo.

increasing subsequence Test for independence Copula We propose a new class of nonparametric tests for the supposition of independence between two continuous random variables X and Y.

We identify the independence assumption of the null hypothesis with the uniform distribution on the permutation space.

1 Journal of Multivariate Analysis 127 (2014) Contents lists available at ScienceDirect Journal of Multivariate Analysis journal homepage: Independence tests for continuous random variables based on the longest increasing subsequence Jesús E. García, V. A. González-López Department of Statistics, University of Campinas, Rua Sérgio Buarque de Holanda, 651, Campinas, São Paulo. CEP , Brazil a r t i c l e i n f o a b s t r a c t Article history: Received 13 March 2013 Available online 3 March 2014 AMS 2000 subject classifications: 62G10 62G30 Keywords: Longest increasing subsequence Test for independence Copula We propose a new class of nonparametric tests for the supposition of independence between two continuous random variables X and Y. Given a size n sample, let π be the permutation which maps the ranks of the X observations on the ranks of the Y observations. We identify the independence assumption of the null hypothesis with the uniform distribution on the permutation space. A test based on the size of the longest increasing subsequence of π (L n ) is defined. The exact distribution of L n is computed from Schensted s theorem (Schensted, 1961). The asymptotic distribution of L n was obtained by Baik et al. (1999). As the statistic L n is discrete, there is a small set of possible significance levels. To solve this problem we define the JL n statistic which is a jackknife version of L n, as well as the corresponding hypothesis test. A third test is defined based on the JLM n statistic which is a jackknife version of the longest monotonic subsequence of π. On a simulation study we apply our tests to diverse dependence situations with null or very small correlations where the independence hypothesis is difficult to reject. We show that L n, JL n and JLM n tests have very good performance on that kind of situations. We illustrate the use of those tests on two real data examples with small sample size Elsevier Inc. All rights reserved. 1. Introduction Call Ω the space of the univariate, continuous cumulative distributions. Let (X, Y) be a random vector with unknown joint cumulative distribution H and univariate marginal distributions F and G respectively, F Ω, G Ω. Suppose that (x 1, y 1 ),..., (x n, y n ) is a paired sample of size n of (X, Y). Set H 0 : X and Y are independent. A test is constructed with no extra assumption (other than continuity) about the form of the marginal distributions (marginal free test). The procedure is based on the size of the longest increasing subsequence of the random permutation defined by the paired sample and denoted by L n. Theorem 3.1 shows how to compute the exact distribution of L n and it is a straightforward application of Schensted s theorem and Frame et al. s theorem, see Schensted [12] and Frame et al. [6]. In addition, we proposed two test statistics denoted briefly by JL n and JLM n, respectively. JL n is a Jackknife version of L n while JLM n is based on the size of the longest monotonic subsequence. The power of these tests is compared with those of various existing tests by simulation. This new class of tests is rankbased, therefore, it will be compared with other rank-based procedures for testing independence as the nonparametric tests Kendall, Spearman and Hoeffding and the independence test from Genest et al. [7], denoted here by Genest s test. (1) Corresponding author. addresses: jg@ime.unicamp.br (J.E. García), veronica@ime.unicamp.br (V. A. González-López) X/ 2014 Elsevier Inc. All rights reserved.

2 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) Fig. 1. The left figure is the scatter plot of a sample (size = 200) from a mixture of two bivariate Normal distributions, with correlation 0.9 and 0.9 respectively (distribution D1 from Section 4). The right figure shows the plot of the sample size vs. the empirical power (level 0.01) for the same distribution. We include also the MIC test, based on the maximal information coefficient, from Reshef et al. [11]. In addition we include Pearson s test for its well known performance in the normal case. In the case of Kendall s test, Spearman s test, Hoeffding s test and Pearson s test, each methodology estimates the association between X and Y and computes a test of the association being zero. They use different measures of association, all of them in the interval [ 1, 1] with 0 indicating no association/correlation. The asymptotic Genest s test consist on computing the approximate p-values of the test statistic with respect to the empirical distribution obtained by simulation. For the MIC test, the p-value of a given MIC score is computed by selecting a probability δ of false rejection, creating a set of 1 1 surrogate datasets, and comparing the MIC δ of the real data with the MIC scores of the surrogate datasets. To compute the p-values for Kendall, Spearman and Pearson methods, we use the cor.test function, available in the stat package from R-project. Details about each test may be found in Hollander et al. [9]. In the case of Hoeffding s test, to compute the p-values, we use the hoeffd function, available in the Hmisc package from R-project. For Genest s test we use the indeptest function, available in the copula package from R-project. For the MIC test was used the support program given in We performed a simulation study with different conditions. For example, we use a mixture of two bivariate Normal distributions, with correlation ρ and ρ respectively (zero expected correlation). In this case L n, JL n and JLM n were competitive and markedly more powerful than the other six tests considered. Fig. 1, on the left, shows a scatter plot for a sample (size = 200) of this mixture when ρ = 0.9 and Fig. 1, on the right, shows the sample size versus the empirical power (level 0.01). The other tests do not detect the dependence for any sample size. This situation illustrates the usefulness of our proposal, we will explore more situations like that, in Section 4.2. We applied the tests based on the longest increasing subsequence to two real datasets, both with small sample sizes considering that for bigger sample sizes there exists very efficient procedures designed for asymptotic situations. The first dataset was provided by Professor Dalia Chakrabarty, researcher in the School of Physics and Astronomy, University of Nottingham. It consist on two measures, the projected radius and the radial velocity for 30 Globular Clusters around the galaxy NGC 3379 (see Chakrabarty [5]). The second dataset appears on VGAM (package from R-project), named coalminers. The data is about coal-miners who are smokers without radiological pneumoconiosis, classified by age, breathlessness and wheeze. We adapted and implemented (in C language) the algorithm provided by Zoghbi et al. [13]. We use that algorithm to compute the exact probability of L n, in the case of n 100. For n > 100 the asymptotic distribution of L n, obtained by Baik et al. [3] can be used and we show how to use it in our test, in Section 3. Nevertheless, the exact probability could be calculated for n > 100 also. The probabilities for JL n and JLM n were estimated by simulation. The tests and simulations were implemented in the R-project environment (LIStest package). Section 2 provides the main concepts and the definition of the test statistic. In Section 3 we calculate the distribution of the test statistic, proposed here. In Theorem 3.1 is shown the exact distribution of the test statistic under the independence assumption, by a direct application of results from Schensted [12] and Frame et al. [6]. Section 4 is devoted to show the capacity to detect dependence of each test statistic introduced here. Through simulations, we discuss each one of the test statistics, face to face with several dependence situations. We apply the test, to real datasets in Section 4.3. In the Appendix A we include the proof of Theorem 3.1.

3 128 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) Table 1 Paired sample size 5. x i y i a b c Fig. 2. Dispersion s graphic and permutation (Example 2.2). (a) Is the dispersion plot for the sample, (b) represents the permutation defined by the sample, the solid line shows the longest increasing subsequence, (c) shows the empirical copula of the sample. 2. Preliminaries We will introduce some basic concepts related to the size of the longest increasing subsequence, associated with a paired sample of size n of (X, Y) with continuous marginal distributions. Definition 2.1. Let S n denote the group of permutations of {1,..., n}. If π S n, we say that π(i 1 ),..., π(i k ) is an increasing subsequence of π if 1 i 1 < < i k n and 1 π(i 1 ) < π(i 2 ) < < π(i k ) n. Definition 2.2. Given a permutation π S n, we call l n (π) (or ld n (π)) the length of the longest increasing (or decreasing) subsequence of π. Example 2.1. Consider the set {1, 2, 3, 4, 5, 6, 7, 8}. Let π be the permutation which transforms the previous set in {3, 6, 1, 7, 4, 2, 5, 8} where π(1) = 3, π(2) = 6, π(3) = 1, π(4) = 5, π(5) = 7, π(6) = 2, π(7) = 4, π(8) = 8. Examples of increasing subsequences are {1, 7, 8}, {3, 6, 7, 8}, {1, 2, 5, 8}. The maximal size for the increasing subsequences is 4 which is reached by the sequences {1, 2, 5, 8}, {1, 4, 5, 8} and {3, 6, 7, 8}, then l 8 (π) = 4. We bring the concept of the longest increasing subsequence to the sample space, using the next example, in which we will connect the sample with a specific permutation of n points, π. Example 2.2. Let us consider the paired sample {(x i, y i )} n i=1 (from Table 1). First, sort the sample in increasing order in relation to the marginal sample {x i } n i=1 and replace the x i value with its rank in the sequence, this produces {(1, 3.5), (2, 2.86), (3, 4.17), (4, 3.18), (5, 3.2)}. Next, replace each y i with its rank in the {y i } n i=1 sequence, this produces {(1, 4), (2, 1), (3, 5), (4, 2), (5, 3)}. The permutation π related to this sample is defined by π(1) = 4, π(2) = 1, π(3) = 5, π(4) = 2, π(5) = 3. The longest increasing subsequence is {1, 2, 3} and l 5 (π) = 3, see Fig. 2(b). We define now the length of the longest increasing subsequence as a random variable. Definition 2.3. Let (X 1, Y 1 ), (X 2, Y 2 ),..., (X n, Y n ) be replications of (X, Y) with continuous marginal distributions, we denote by L n the random variable, L n = l n (π D ) where D = {(X i, Y i )} n i=1 and π D is the permutation which assigns π(rank(x i )) = rank(y i ), i = 1,..., n. On the next section we show the distribution of L n, under the assumption of independence between X and Y.

4 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) The distribution of L n The exact distribution of L n in the case of independence can be obtained using the next theorem, in which the probability of L n be equal to k, for k = 1,..., n will be denoted by p n k. Theorem 3.1. Let (X, Y) be a random vector with continuous marginal distributions, under Hypothesis (1). Suppose that (x 1, y 1 ),..., (x n, y n ) is a paired sample of size n of (X, Y). Let S n denote the group of permutations of {1,..., n} and let S n, U be S n with uniform distribution U. Then, for k = 1, 2..., n, if p n = k Prob(L n = k), p n = 1 n k N(W) 2 (2) n! m=1 W V n (k,m) where L n is given by Definition 2.3, V n (k, m) is the set of shapes of standard Young tableaux of order n having k columns and m rows, N(W) is the number of standard Young tableaux with shape W as given by Formula (6). Proof. See Appendix A. Remark 1. S n, U is the space of permutations π D where D = {(X i, Y i )} n i=1, and (X i, Y i ) are i.i.d. with the same law of (X, Y) under Hypothesis (1). For k = 1, 2..., n, p n k = #{π S n:l n (π)=k} n!. There are diverse algorithms in the literature to find V n (k, m), we implemented the ZS2 algorithm by Zoghbi et al. [13]. Using Theorem 3.1 we compute p n k for 1 k n, n 100. The table can be accessed from the LIStest package, implemented in R project. The asymptotic distribution of L n in the case of independence, after appropriate centering and scaling, was first obtained by Baik et al. [3]. Let q(z) denote the solution of the Painlevé II equation given by, q (z) = 2q 3 + zq, satisfying the boundary condition q(z) Ai(z) when z, where Ai is the Airy function and q denotes the second derivative of q. Hastings et al. e [8] show the asymptotic solutions, q(z) = Ai(z) + O (4/3)z3/2 z 1 as z, q(z) = 1 + O as z. z 1/4 2 z 2 The Tracy Widom distribution is defined by the following cumulative distribution F TW (t) = exp (z t)q 2 (z)dz t, t R. (3) Theorem 3.2. Under the assumptions of Theorem 3.1, if χ is a random variable whose distribution function is F TW, given by Eq. (3), then χ n = L n 2 n n 1/6 χ in distribution, as n. (4) Proof. See Baik et al. [3]. For n > 100 we use the asymptotic distribution of L n through Eq. (4). We calculate the asymptotic p-values using the R-package RMTstat, specifying the parameter β = 2 in the cumulative function ptw The L n independence test Let (x 1, y 1 ),..., (x n, y n ) be a paired sample of size n of (X, Y) with continuous marginal distributions. The p-value for a statistical test with null hypothesis of independence against an alternative hypothesis of not independence between X and Y is defined in the following way. Definition 3.1. The two-sided p-value is min 2F Ln (l 0 )I + F Ln (l 0 ) 1 2(1 F Ln (l 0 ))I, F 2 Ln (l 0 )> 1 1, 2 where l 0 is the observed value of L n in the sample, F Ln is the cumulative distribution function, F Ln (l 0 ) = l 0 k=1 pn k (see Eq. (2)) and I E denotes the indicator function of set E.

5 130 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) The JL n, statistic The JL n statistic is obtained from two modifications to the L n statistic. The first modification is based on Johansson [10]. That paper shows that, in the independent case, if we consider U i = rank(x i ) and V i = rank(y i ), for i = 1,..., n then the typical deviations of a maximal path from the diagonal U = V is of order n 5/6. The first modification is that will only consider points whose ranks are at a distance less than or equal to cn 5/6, from the diagonal U = V, i.e. U i V i cn 5/6, where c is a constant. To choose the value of c, we checked with different values of c on simulated data and used the best one. We started with c = 0.1 then c = 0.2 which gives us better power, and so on until the power started to go down which happened for c = 0.5. Table 15 shows the results of the simulation study used to choose c. Note that the power of the test does not change too much for values of c between 0.3 and 0.5. We choose c = 0.4 as it seems to give the best power for the distributions and sample sizes used in the simulation. Formally, we introduce the set D diag = (U i, V i ), i = 1,..., n : U i V i 0.4n 5/6, and we define L diag n = l n diag(π D diag), with n diag = # D diag. The second modification is a jackknife procedure. The statistic L n is discrete, Fig. 3 (a) shows F L80 which is L n cumulative distribution function for n = 80. For example, F L80 ( ) jumps from for x = 11 to for x = 12, then, if we want to test at level 0.05 for unilateral alternative, we will reject the independence when a unilateral p-value is and the exact level 0.05 cannot be achieved. To mitigate this characteristic we define a jackknife version of the L diag n statistic. Definition 3.2. Let (X 1, Y 1 ), (X 2, Y 2 ),..., (X n, Y n ) be replications of (X, Y) with continuous marginal distributions. We define JL n = 1 n diag (u, v), L diag n (u,v) D diag where L diag n (u, v) = l n diag 1(π D (u,v)) with D (u,v) = D diag \ {(u, v)}, for each (u, v) D diag. Fig. 3 (b) shows F JL80 which is the JL n cumulative distribution function for n = 80. We can see that the number of steps in the function has grown. In this case, F JL80 ( ) jumps from for x = to for x = , this means that if we want to test at level 0.05 in practice we will test at level Remark 2. If we reexamine the cumulative distribution for the statistic L 80 in Fig. 3 (left), we can see that (under the Hypothesis (1)) the set of values with probabilities significantly different from zero for L 80 is {11, 12,..., 17, 18, 19}, as can be seen in the following table. l p 80 l In the same way for JL n, under the Hypothesis (1), the set of values with probabilities significantly different from zero is inside the interval (9, 22), as can be seen from the right picture in Fig. 3. Because of this if some underlying dependence structure between the random variables increases or decreases the size of the longest increasing subsequence, even in a small quantity (4 or 5 for n = 80), it can easily take the L n or JL n statistic to regions of very low probability. Another useful characteristic of the L n (and JL n ) statistic is that, as seen in Aldous et al. [1], under the Hypothesis (1), E(L lim n ) n n = 2, in other words, under the assumption of independence, when n grows L n grows like 2 n The JLM n, statistic The idea behind the JLM n statistic, is to use the size of the longest monotonic subsequence, which is the maximum between the longest increasing subsequence and the longest decreasing subsequence. The size of the longest decreasing subsequence of a sample {(x i, y i )} n i=1 is the size of the longest increasing subsequence for the sample {(x i, ( y i ))} n i=1. As before, we only consider points that are at a distance smaller than 0.4n 5/6 from the corresponding diagonal, derived from the observations transformed through the ranks, as described in Section 3.2. Definition 3.3. Let (X 1, Y 1 ), (X 2, Y 2 ),..., (X n, Y n ) be replications of (X, Y) with continuous marginal distributions. We define JLM n = max{jl n, JL n }, with JL n and JL n given by Definition 3.2 applied over (X 1, Y 1 ), (X 2, Y 2 ),..., (X n, Y n ) and (X 1, Y 1 ), (X 2, Y 2 ),..., (X n, Y n ), respectively.

6 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) Simulations Fig. 3. (a) L n cumulative distribution function for n = 80 (left). (b) JL n cumulative distribution function for n = 80 (right). To compare the power of our tests against Pearson s test, Kendall s test, Spearman s test, Hoeffding s test, Genest s test and MIC s test, we carried out a simulation study in which for each test we estimate the power function for different sample sizes and diverse joint distributions. For each joint distribution and sample sizes 20, 40, 60, 80, 100 we simulated 5000 samples, and computed the p-values. The simulation was implemented in R-project. Denote by (X j, i Y j ) n i the j-th i=1 simulated sample, with j = 1,..., 5000 and n = 20, 40, 60, 80, 100. Given a level α, we calculate the empirical significance level as being, # j : p-value (X j, i Y j ) n i i= where p-value (X j i, Y j i ) n i=1 α denotes the p-value associated with the sample j, (X j i, Y j i ) n i=1 (5). The p-values for the L n, JL n and JLM n statistics were calculated using our R package LIStest. For n 100 the LIStest package uses the exact values of probabilities for the L n, computed using Theorem 3.1. For n > 100 it uses the Tracy Widom approximation given by Theorem 3.2. In the LIStest package, the distributions for JL n and JLM n were estimated by simulation for n 200. We divided our simulations into two parts, the independence case to compare (5) for different sample sizes and the dependent case to measure and compare the power of the tests for diverse situations. Complete tables with the computed empirical power obtained for each situations can be consulted in the Appendix of this paper (see Appendix B) Independence Considering that the tests (except Pearson s test) are marginal free, we analyze only two distributions for the case of independence. The first distribution is pairs of independent random variables with Normal (standard) marginal distributions. The second distribution, with heavier tails, consists of pairs of independent random variables with Pareto of parameter 4 distribution. Fig. 4, and Tables 7, 8 show the behavior of the empirical significance levels for these two distributions. We note that the power of Pearson s test can achieve values significantly higher than α = 0.01, under the effect of the marginal distributions. When, for example, the marginal distributions are heavy-tailed, as the Pareto distribution. See for illustration, the lower panel in Fig. 4, it shows the effect of the marginal distribution on Pearson s test. Situations such as these show the importance of using a marginal free methodology to detect dependence. We can also see that the empirical significance levels for JL n and JLM n tests are much closer to the theoretical α compared to the L n test Dependence The conjecture No dependence test is known to be optimal under all dependence structures. With this in mind our conjecture is that the proposed family of tests is efficient to detect types of dependences with null correlation or when the correlation takes very

7 132 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) Fig. 4. The picture on the left is the scatter plot of a sample size 200, the picture on the right is the sample size vs. the empirical significance level (α = 0.01) for the same distribution. On top: independent N(0, 1) random variables and on bottom: independent Pareto(4) random variables. small values and we concentrate our study on joint distributions with zero or very small correlations that challenge many tests of independence. In order to introduce some intuition about the behavior of the new test in traditional situations, we explore onwards four models with medium and high correlations, those are (i) Gumbel s copula with parameter θ, where the cumulative distribution is given by C G (x, y θ) = exp{ [( ln(x)) θ + ( ln(y)) θ ] 1/θ }, θ [1, ); (ii) Frank s copula with }, θ (, )\{0}; (iii) Clayton s ln{1+ (exp( θx) 1)(exp( θy) 1) exp( θ) 1 parameter θ, and cumulative distribution C F (x, y θ) = 1 θ copula with parameter θ, and cumulative distribution C C (x, y θ) = max{(x θ + y θ 1) 1/θ, 0}, θ [ 1, ) \ {0} and (iv) normal bivariate distribution with correlation ρ, Tables 2 and 3 show the results. In all the simulated cases we fixed the parameters θ and ρ, in order to obtain a correlation between x and y approximately equal to 0.5 and 0.7. For large sample sizes, at the nominal level equal to 0.01, the new family of tests detects the dependence but with a lower power than the others tests. Except some situations as in the case of correlation approximately equal to 0.7, where the new family of tests showed positive results. In general, Pearson s tests show the highest power levels in all the distributions with moderate to large correlation considered in this study, see Tables 2 and 3. To illustrate in detail the behavior that we observe in the cases included in Tables 2 and 3, see Fig. 5. We also observed that the statistics introduced in this paper are not consistent against all alternatives. For example, they are not able to recognize the difference between the Uniform distribution on [0, 1] 2 and the distribution given by f (x, y) = 1

8 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) Table 2 Empirical significance level (α = 0.01). For each case stand out the best power, in bold letter. Dist. n Spe Ken Pea Hoe Mine Gen L n JL n JLM n Gumbel θ = (cor ) Gumbel θ = (cor ) Frank θ = (cor ) Frank θ = (cor ) Clayton θ = (cor ) Clayton θ = (cor ) Table 3 Empirical significance level (α = 0.01), bivariate Normal distribution with variance 1 and correlation ρ. For each case stand out the best power, in bold letter. ρ n Spe Ken Pea Hoe Mine Gen L n JL n JLM n if (x, y) [0, 1] 2 \ (A B), f (x, y) = 2 if (x, y) B and f (x, y) = 0 otherwise, where A = [0.5 a, a] [1 a, 1] and B = [0.5 a, a] [0, a], with a very small value a. That means, the distribution f is equal to the Uniform except on the sets A and B. On A, f takes null values and on B, f assumes value equal to 2 (because the mass on A was reallocated to B). The family of tests given in this paper, will recognize the magnitude of the density distributed on the diagonals (x = y and/or x = y) and in its neighborhood Settings of dependence We consider two main situations in which the samples show low correlation. (a) Visible dependence; (b) hidden dependence. In the first group we explore distributions with the following x y plot shapes (i) a cross, (ii) a ring, and (iii) a square. All of them are types of dependence with null expected correlation coefficients. We note that Pearson s test, Kendall s

9 134 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) Fig. 5. The picture on the left is the scatter plot of a sample size 200 of the Gumbel copula. The picture on the right is the sample size vs. the empirical significance level (α = 0.01) for the same distribution. On top θ = 1.55 (correlation 0.5), and on bottom θ = 2.07 (correlation 0.7). test and Spearman s test are not consistent for the Hypothesis (1) explaining its poor performance in almost all the cases exposed in this section. For case (a) we implemented the following joint distributions D1 Mixture of two bivariate Normal distributions with variances 1 and correlations ρ and ρ; (X, Y) 1 N 2 2 0, Σ1 + 0, Σ2, where 0 = (0, 0), Σ1 = 1 and Σ 2 = N 2 ρ ρ 1 ρ ρ 1 D2 Uniform ring centered at 0 with internal radius of ρ and external radius of 1. D3 Uniform distribution on {[ 1, 1] [ 1, 1]} \ {[ ρ, ρ] [ ρ, ρ]} (border of a square). In all the cases (see Figs. 1, 6 9 and Tables 9 11 respectively), the new family of tests meet the highest empirical powers. D1, D2 and D3 situations show how it is possible to enhance the efficiency of statistical tests based on L n, exemplified here by statistics JL n and JLM n. For case (b) we implemented the following joint distributions

10 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) Fig. 6. The picture on the left is the scatter plot of a sample size 200 of D1. The picture on the right is the sample size vs. the empirical significance level (α = 0.01) for the same distribution with ρ = 0.7. Table 4 P-values, Globular Clusters-NGC 3379 galaxy. Spe Ken Hoe Pea Gen MIC L n JL n JLM n D4 Mixture of two bivariate Normal distributions, one independent with standard deviation 4 and the other dependent with standard deviation 1 and correlation ρ, (X, Y) 1 N 2 2 0, 16I + 1 N 2 2 0, Σ, where 0 = (0, 0), 16I = and Σ = 1 ρ ρ. 1 D5 Mixture of two bivariate Normal distributions, γ % independent with standard deviation 4 and (1 γ )% dependent with standard deviation 0.5 and correlation ρ = 0.95, (X, Y) γ N 2 0, 16I + (1 γ )N2 0, Σ, where 0 = (0, 0), 16I = and Σ = ρ ρ 0.25 D6 Mixture of two bivariate Clayton s copulas, one with parameter 0.1 and the other with parameter equal to 10; (X, Y) 0.75C C (, 0.1) C C (, 10). For the case of distribution D6, Spearman s correlation was about Spearman s, Kendall s and Pearson s tests cannot detect the dependence, because the sample proportion (25%) which has strong correlation (around 0.94) is too small compared with the sample proportion (75%) having negative and small correlation (about 0.17). According to the results shown by Figs and Tables respectively, in scenarios D4, D5 and D6, the statistics L n, JL n and JLM n reach the best results followed by Hoeffding s test Applications Application 1, Globular Clusters data The dataset is composed by a sample of globular clusters (GC) around the galaxy NGC 3379 (see Bergond et al. [4] and Chakrabarty [5]). The NGC 3379 is the brightest elliptical galaxy in the constellation Leo and it is known to have a supermassive black hole. The measures (Fig. 13) are the Projected Distance expressed in kpc (x axis) and the Line of Sight (LOS) radial Velocity expressed in km/s (y axis) to the galaxy for 30 GCs. While conceptually the dependence between the Projected Distance and the LOS Velocity exists, it is not detected by Pearson s test, Kendall s test, Spearman s test, Hoeffding s test, Genest s test and MIC s test, see the results in Table 4. Astronomers use this kind of relation to infer the total mass distribution in galaxies. They can compare, for example, the globular cluster system of NGC 3379 (with a black hole) and some planetary nebulae (without a black hole) in order to infer the influence produced by the presence of a black hole. L n test (with a p-value = ), JL n test (with a p-value = ) and JLM n test (with a p-value = ) are capable to show that dependence. Spearman s correlation between the Projected Distance and the LOS Velocity is equal to

11 136 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) Fig. 7. The left panel is the scatter plot of a sample size 200 of D2. The right panel is the sample size vs. the empirical significance level (α = 0.01) for the same distribution. On top ρ = 0, and on bottom ρ = Application 2, coalminers data The dataset named coalminers, appears on VGAM (package from R-project). The data is about coal-miners who are smokers without radiological pneumoconiosis, classified by age, breathlessness and wheeze. Denote by BW the counts with breathlessness and wheeze, BnW the counts with breathlessness but no wheeze, nbw the counts with no breathlessness and wheeze. Fig. 14, on the left, shows the plot between BnW and BW, while Fig. 15, on the left, shows the plot between BW and nbw. Each point, was took according to 9 age-groups. In both situations the dependency appears as a consequence of event B (breathlessness) or W (wheeze) respectively. Since Figs. 14 and 15 (left) expose an increasing tendency, we can test some unilateral hypotheses for the relation BnW versus BW and BW versus nbw, respectively. For the tests based on a specific measure, we test measure > 0 ; for the L n test we test L n > M 0, where M 0 is the mode of the distribution under the independence assumption. For that kind of hypotheses, we compute the exact p-values for Pearson s test, Spearman s test, Kendall s test (see the description of the function cor.test from R-project) and the L n test. For the first case (BnW versus BW) all the tests reject the independence in favor of an increasing tendency, with small p-values (lesser that 1e-05) and with dependence coefficients taking values approximately equal to 1 (Pearson s correlation,

12 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) Fig. 8. The left panel is the scatter plot of a sample size 200 of D2. The right panel is the sample size vs. the empirical significance level (α = 0.01) for the same distribution. On top ρ = 0.3, and on bottom ρ = 0.5. Table 5 Unilateral hypotheses (BW, nbw). Test Spe Ken Pea L n JL n Coefficient p-value Spearman s rank correlation and Kendall s rank correlation). In contrast, for the second situation (BW versus nbw) the tests based on correlations fail to reject the null hypothesis, while the L n test and JL n test reject it, at level 5%. We show in Table 5 the results in which case, the L n test and JL n test show the best performance. 5. Conclusions In this work we develop a new class of nonparametric independence tests for the independence of two continuous random variables. For the L n test we show the exact distribution for the test statistic, for the JL n and JLM n tests we estimated

13 138 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) Fig. 9. The left panel is the scatter plot of a sample size 200 of D3. The right panel is the sample size vs. the empirical significance level (α = 0.01) for the same distribution. On top ρ = 0.5, and on bottom ρ = 0.7. the distribution by simulation. We compare our family of tests with Pearson s test, Kendall s test, Spearman s test, Genest s test, Hoeffding s test and MIC s test using simulations. We apply the L n test and its variants in two real data examples. The inability of L n, to reach the nominal α level (by construction) is successfully eliminated in its variants JL n and JLM n. For the sample sizes considered in our study, the tests based on the longest increasing subsequence were the only ones capable to detect dependence, for the distributions D1 D6. Followed by Hoeffding s test in the cases D4 and D6, it is necessary to emphasize the ability of these new tests to address situations with moderate sample sizes (even small). In all the cases in which L n test and related, work well, they have the highest power for sample sizes bigger than 40, this property added to the capacity to control the significance level (through the JL n and/or JLM n versions) put this procedure in an advantaged position in relation to the other tests, as is strongly exposed in the simulation study and in the applications to real data. According to the simulation study (Section 4) L n, JL n and JLM n tests have a remarkable behavior in the mixture cases, in which the samples are composed by two subsamples, coming from a strongly correlated distribution and from a weakly correlated distribution, respectively. In summary, in this paper we wish to draw attention to the potential of statistics that are constructed with the longest increasing subsequence. Such statistics can be useful for detecting dependency types, difficult to identify with the tests available in the literature.

14 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) Fig. 10. The left panel is the scatter plot of a sample size 200 of D4. The right panel is the sample size vs. the empirical significance level (α = 0.01) for the same distribution. On top ρ = 0.9, and on bottom ρ = Acknowledgments The authors gratefully acknowledge the support for this research provided by (a) USP project Mathematics, computation, language and the brain, (b) Portuguese in time and space: linguistic contact, grammars in competition and parametric change, FAPESP s project, grant 2012/ and (c) FAPESP Center for Neuromathematics (grant 2013/ , S. Paulo Research Foundation). Special thanks to Professor Dalia Chakrabarty for making the astronomical data used in this paper available to us. We wish to thank the referees and an associate editor for their many helpful comments and suggestions on an earlier draft of this paper. Appendix A. Proof of Theorem 3.1 The combinatorial concepts that we will introduce, useful in representation theory, have been extensively developed in Frame et al. [6], Schensted [12] and Baer et al. [2].

15 140 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) Fig. 11. The left panel is the scatter plot of a sample size 200 of D5. The right panel is the sample size vs. the empirical significance level (α = 0.01) for the same distribution. On top γ = 0.75, and on bottom γ = Definition A.1. A standard Young Tableau of order n is an arrangement of n distinct natural numbers in rows and columns so that the numbers in each row and in each column form increasing sequences, and so that there is an element of each row in the first column and an element of each column in the first row, and there are no gaps between numbers. There is a 1 1 correspondence between permutations and standard Young Tableaux. To each permutation we can assign a corresponding standard Young tableaux, citing Baer et al. [2], in the following way. Let the permutation be {x 1, x 2,..., x n }. For the moment, define the first entry in the first row of the tableau to be x 1. Now, if at the i-th step, the first i entries of the sequence have been used in the developing tableau then at the next step the element x i+1, is inserted into the first row of the tableau by displacing the smallest entry in the first row which is larger than x i+1, or by appending x i+1, at the end of the first row if it is larger than all entries in the first row. If an entry y is displaced from the first row by x i+1, then y is inserted into the second row by letting it displace the smallest entry in the second row which is larger than y or by simply appending y to the second row if there is no such element. The process is continued from row to row until either the original x i+1, or a displaced element is appended to the end of a row. Then the whole process is renewed for x i+2,... until all of the entries of the original permutation sequence have been entered into the tableau.

16 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) Fig. 12. The left panel is the scatter plot of a sample size 200 of D6. The right panel is the sample size vs. the empirical significance level (α = 0.01) for the same distribution. Fig. 13. The plots show the Projected Distance (kpc) vs. the Line of Sight Velocity (on the left) and the ranks of the Projected Distance vs. the ranks of the Line of Sight Velocity (on the right) for 30 GCs associated with NGC Example A.1. Applying the algorithm given by Baer et al. [2] and reproduced in the previous paragraph, to the permutation {3, 6, 1, 5, 7, 2, 4, 8} we find the following sequence of arrangements. The last arrangement (step 8) is the standard Young Tableaux. step step step Remark 3. The first row on the standard Young Tableau corresponds to one of the longest increasing subsequence.

17 142 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) Fig. 14. The plots show counts with breathlessness and no wheeze (BnW) vs. counts with breathlessness and wheeze (BW) (on the left) and the ranks of BnW vs. the ranks of BW (on the right). Fig. 15. The plots show counts with breathlessness and wheeze (BW) vs. counts with no breathlessness and wheeze (nbw) (on the left) and the ranks of BW vs. the ranks of nbw (on the right). Definition A.2. If T is a standard Young Tableau of order n, for each element j, j {1,..., n} of the arrangement we define the hook number of j, h j as the number of elements in the same column and in the same row in which j is included, counting from the bottom until the element j and from the right to the row until the element j. Example A.2. We illustrate the concepts introduced by means of Example 2.2. standard Young Tableau hook numbers Remark 4. By definition, the h j numbers depend on the shape of the tableau not on the numbers filling it. Each permutation is directly associated with the shape of a standard Young Tableau, but different permutations of {1,..., n} can give the same tableau shape. The next example shows all the possible shapes of standard Young Tableaux that can be obtained by the permutations of 5 numbers. Example A.3. Consider the set {1, 2, 3, 4, 5}. Each shape of the list (Table 6) is associated with an integer partition, which is a way of writing n as a sum of positive integers, denoted by IP(n) (n = 5 in this case).

18 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) Table 6 List of shapes (of standard Young Tableaux) and hooks numbers for the permutations of {1, 2, 3, 4, 5}. Shape 1 Shape 2 Shape 3 Shape 4 Shape 5 Shape 6 Shape IP1(5) IP2(5) IP3(5) IP4(5) IP5(5) IP6(5) IP7(5) Table 7 Empirical power at level α = 0.01 for independent random variables with Normal marginal distributions. n Spe Ken Pea Hoe MIC Gen L n JL n JLM n Table 8 Empirical power at level α = 0.01 for independent random variables with Pareto(4) marginal distributions. n Spe Ken Pea Hoe MIC Gen L n JL n JLM n Table 9 Empirical power at level α = 0.01 for distribution D1. For each case stand out the best power, in bold letter. ρ n Spe Ken Pea Hoe MIC Gen L n JL n JLM n Shape 1 corresponds to the permutation π(1) = 5, π(2) = 4, π(3) = 3, π(4) = 2, π(5) = 1 (ld 5 = 5) and it is associated to the integer partition of n = 5 given by IP1(5) = 5 (the sum of the number of elements in the first column of the shape 1). The shape 5 is associated to the integer partition of n = 5, IP5(5) = , where each term of IP5(5) (from left to right) is the sum of the number of elements by column in the shape 5. Given a permutation π, the size of the longest increasing subsequence for π is the size of the first row in the shape of the Tableau corresponding to the permutation. The next results allow to compute the number of permutations of n numbers such that l n (π) = k which is the number of standard Young tableaux with a shape such that the first row has size k. Theorem A.1 (Frame et al. [6]). Given a shape W, the number of standard Young tableaux with shape W, containing the integers {1,..., n} is N(W) = n!, n h j j=1 where the h j, j = 1,..., n are the hook numbers for each cell of the Tableau. (6)

19 144 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) Table 10 Empirical power at level α = 0.01 for distribution D2. For each case stand out the best power, in bold letter. ρ n Spe Ken Pea Hoe MIC Gen L n JL n JLM n Table 11 Empirical power at level α = 0.01 for distribution D3. For each case stand out the best power, in bold letter. ρ n Spe Ken Pea Hoe MIC Gen L n JL n JLM n Table 12 Empirical power at level α = 0.01 for distribution D4. For each case stand out the best power, in bold letter. ρ n Spe Ken Pea Hoe MIC Gen L n JL n JLM n Example A.4. The number of standard Young tableaux containing the numbers {1, 2, 3, 4, 5} with shape given by Example A.2 (left) is 5!/[4.3.2] = 5 (using the values of Example A.2 (right)). Theorem A.2 (Schensted [12]). Let V n (k, m) be the set of shapes of Young tableaux of order n, having k columns and m rows. The number of permutations of n elements with a longest increasing subsequence of size k and a longest decreasing subsequence of size m is W V n (k,m) N(W)2. Example A.5. Considering the set of numbers {1, 2, 3, 4, 5} we want to calculate the number of sequences having l 5 = 3. Let us denote by # {A} the cardinal of the set A; # {l 5 = 3} = # {l 5 = 3, ld 5 = 2} + # {l 5 = 3, ld 5 = 3}, corresponding with

20 J.E. García, V. A. González-López / Journal of Multivariate Analysis 127 (2014) Table 13 Empirical power at level α = 0.01 for distribution D5. For each case stand out the best power, in bold letter. ρ n Spe Ken Pea Hoe MIC Gen L n JL n JLM n Table 14 Empirical power at level α = 0.01 for distribution D6. For each case stand out the best power, in bold letter. n Spe Ken Pea Hoe MIC Gen L n JL n JLM n Table 15 Empirical power of the test JL n at level α = 0.01, for c = 0.1, 0.2, 0.3, 0.4 and 0.5 for six dependence situations. Distribution c D ρ = D ρ = D ρ = D ρ = D ρ = Bivariate Normal ρ = only two possible shapes of Young tableaux, shape 4 and shape 5 (see Table 6). Using the Theorem A.2, # {l 5 = 3, ld 5 = 2} = 5 2 = 25, # {l 5 = 3, ld 5 = 3} = 6 2 = 36 and # {l 5 = 3} = = 61. Proof. Let (X 1, Y 1 ),..., (X n, Y n ) be independent, identically distributed, bivariate random vectors with the same distribution of (X, Y). X and Y verify de Hypothesis (1) with continuous marginal distributions. We can define L n as

First steps of multivariate data analysis

First steps of multivariate data analysis November 28, 2016 Let s Have Some Coffee We reproduce the coffee example from Carmona, page 60 ff. This vignette is the first excursion away from univariate data.