THE PAIR CHART I. Dana Quade. University of North Carolina. Institute of Statistics Mimeo Series No ~.:. July 1967

Size: px

Start display at page:

Download "THE PAIR CHART I. Dana Quade. University of North Carolina. Institute of Statistics Mimeo Series No ~.:. July 1967"

Hortense Dennis
5 years ago
Views:

1 . _ e THE PAR CHART by Dana Quade University of North Carolina nstitute of Statistics Mimeo Series No. 537., ~.:. July 1967 Supported by U. S. Public Health Service Grant No. 3-Tl-ES-6l-0l. DEPARTMENT OF BOSTATSTCS UNVERSTY OF NORTH CAROLNA Chapel Hill, N. C.

2 . _ - unknown 1 The Pair Chart by Dana Quade University of North Carolina Let Xl,XZ,...,X m be a random sample of m observations on a variable X with distribution function F, and let Yl'YZ'...Y n be a random sample of n observations on a variable Y with unknown distribution function G. classic "two sample problem" is then to compare such samples, and in particular to test the null hypothesis H that F =G. My purpose in this paper is to o show how a certain diagram, which call a pair chart, may give insight into the problem and may also shorten the computations. The pair chart is constructed as follows. The First delineate, on a sheet of ordinary graph paper, a rectangle m units wide and n units high. Starting at the lower left corner of this rectangle, which may be designated as the point (0,0), draw a line one unit to the right if the smallest observation in the two samples combined is an X, and one unit up if it is a Y. Then, from the end of this first line, draw a second line, one unit to the right.if the second smallest observation in the two samples combined is an X, and one unit up if it is a Y. Continue in the same manner; except that, on coming to a between-sample tie involving, say m' XS and n' Y's, draw a sloping line from the end of the last line drawn to the point m' to the right and n' units above it. (Within-sample ties present no difficulties to the construction.) units When lines have been drawn corresponding to all the (m + n) observations, the upper right hand corner (m,n) of the original rectangle will have been reached. 1 Supported by U. S. Public Health Service grant no. 3-Tl-ES to the University of North Carolina nstitute for Environmental Health Studies.

3 For example, suppose the two samples are: x 2,4,4,4,6,6,7,9 (m = 8) 2 _ y 4,7,8,8,8,11 (n = 6) Then the combined sample may be arranged as follows: X (XXXY) (XX) (XY) (yyy) and the corresponding pair chart is as shown in Figure 1. 9 X 11 y Figure 1..- Y (0,0) 2 ~ ~ ~ ~ X 6 6 1/ 7 9 (8,6)

4 3. The explanation of the name "pair chart" is this: each unit square within the rectangle represents one of the mn possible pairs (X.,Y.) which ~ J include one observation from each sample. Thus, the shaded square in Figure 1 corresponds to the pair X 3 = 4, Y s = 8. f the chart includes sloping lines, corresponding to between-sample ties, consider forming boxes such that the sloping lines are their diagonals. n Figure 1 there are two such boxes, outlined with dots. Then if X. = Y. for any particular pair, the corresponding ~ J square is inside such a box; if X. > Y. ~ J the corresponding square lies in the region at the lower right portion of the rectangle; and if X. < Y. lies in the remaining region at the upper left. ~ J The areas of the three the square regions---the one consisting of all the boxes, the lower right one, and the upper left one---are equal, respectively, to the numbers of pairs of the three types: already aware of its interpretation in terms of pairs, Drion did not ever since. those where X. = Y., X. > Y., or X. < Y.. Thus counting the squares ~ J ~ J ~ J in the three regions of Figure 1 shows that there are 4 pairs (X.,Y.) such ~ J that X. = Y., 8 pairs such that X. > ~ J ~ total of mn = 48 in all. Y., and 36 such that X. < Y., out of the J ~ J The pair chart seems to have been invented by Drion [lj, who used it only as a tool for obtaining certain theoretical results concerning the two-sample Kolmogorov-Smirnov test (discussed below). Although he was develop his diagram further, and its potentialities have been overlooked Consider the locus of points (X,Y) = (F(z)), G(z)) for ~ < z < ~. is in general a nondecreasing curve which joins the points (0,0) and (1,1), except that there may be gaps in it, corresponding to discontinuities in F This

5 4. or G: let any such gaps be bridged by straight lines. What results may be called the relative distribution function (RDF) of Y with respect to X; the RDF the line x = y. of X with respect to Y can be obtained from it by reflection in of the two samples: and Now let F and G be the empirical distribution functions m n that is, F (z) = m number of sample XS < z m Then the empirical relative distribution function of Y with respect to X is the locus of points (x,y) = (F (z), G (z», with gaps bridged as for the true m n RDF; G (z) = n number of sample Y's < z n again, reflection in the line x = y gives the empirical RDF of X with respect to Y. The empirical RDF is essentially equivalent to the diagram which Wi1k and Gnanadesikan [7J have called a P-p1ot. t is the sample estimator of the true RDF, which it will approach in the limit with increasing m and n. t may also be thought of as a standardized version of the pair chart, from which it differs only by a change of scale, since the path of lines from (0,0) to (m,n) in the pair chart is based on the locus of points (x,y) = (mf (z), ng (z». The pair chart has the advantage in simplicity m n of construction, however, especially when m and n are unequal. f (and only if) F =G, then the RDF of Y with respect to X, or X with respect to Y, is the diagonal of the unit square with corners (0,0), (0,1), (1,0), (1,1). Various manners of departure from this situation can be given clear interpretations. For example, the RDF of Y with respect to

6 . e 5 X lies generally above the diagonal if the XS tend to be larger than the Y's. And the RDF crosses the diagonal, lying first below and then above it, if the XS tend to be more dispersed than the y's. Thus the form of an RDF may give an extremely useful picture of the comparison between the two populations. This is illustrated in Figures 2 and 3. Figure 2 exemplifies the effects on the RDF of differences in mean and variance between X and Y, by showing the RDF's of some nonstandard normal variables with respect to a standard normal variable. Figure 3 exemplifies the effects of differences in distributional form, by showing the RDF's of various nonnormal variables with respect to a normal variable with the same mean and variance. t may be noted that some of the RDF's in Figure 3 are almost indistinguishable from the diagonal of the unit square. This suggests that, with a rank-order method, it will be difficult to distinguish a normal distribution from, say, a logistic or triangular if the mean and variance are the same. The empirical RDF or, equivalently, the pair chart yields similar interpretations with respect to the samples. Thus the path in Figure 1 lies entirely below the diagonal of the rectangle, and this illustrates how the observed values of X are in general smaller than those of Y. n addition, various tests of the hypothesis that F =G may be obtained by agreeing to reject if the empirical RDF, or pair chart, lies too far, in some appropriate sense, from the diagonal. The occurrence of ties,or at least of ties between samples, is troublesome to nearly all nonparametric procedures in that it complicates calculation of exact P-values. The following general rule for use in dealing with such ties can be stated, however: the correct P-value must be no less

7 . _ a =1 y j.l y=1 Figure 2. RELATVE DSTRBUTON FUNCTONS OF SOME NONSTANDARD NORMAL VARABLES WTH RESPECT TO A STANDARD NORMAL VARABLE ij1x = 0, ax = 1). a =1 y j.l y=2 6 a =1 y j.l y=0 a =2 y a =4 y j.l y=1 j.l y=1 a =2 y a =4 y j.l y=2 j.l y=2 a =2 y a =4 y

8 . _ Figure 3. RELATVE DSTRBUTON FUNCTONS OF SOME NONNORMAL VARABLES WTH RESPECT TO A NORMAL VARABLE HAVNG THE SAME MEAN AND VARANCE. Logistic Laplace 1 Cauchy 7 1 Triangular Chi-square (3 d.f.) Rectangular Chi-square (2 d.f.) V-shaped Chi-square (1 d.f.) Normal variable has same quarti1es rather than same variance.

9 8. than what is obtained if all ties are broken in whatever manner least favors the hypothesis, but it must be no more than what is obtained when the ties are broken in the most favorable manner for the hypothesis. are presented below and are considered in the light of this rule. The Wilcoxon and Mann-Whitney Procedures. Let U x Several tests be the area on the X-side of the path in the pair chart, below the path, so that U x is equal to the number of pairs of observations (X.,Y.) such that X. > Y., plus one half the number of pairs (if any) 1 J 1 J that Xi Y j. Similarly, let U y be the opposite area on the Y-side, so that U y = mn - U X ' Then on dividing U x and U y by mn, the total number of pairs, one obtains estimates of the probabilities P x = P {X > Y} + ~P {X = Y} such and P = P {X < Y} + ~P {X = Y}. Y These estimates, which are unbiased and consistent, are the areas within the unit square below and above the empirical RDF P x and P y the null hypothesis H o [5J, [4J, or [3J. are the corresponding areas for the true RDF. The Mann-Whitney two-sample test rejects H, o alternative that P x P y, if an extreme value of U x occurs. of Y with respect to X, while in favor of the Assuming that FaG is true, and in addition that F and G are continuous, so that ties are impossible, U x has mean mn/2 and variance mn(m + n + 1)/12. ts small-sample distribution has been tabulated: see ts large-sample distribution is normal, so the test is practicable in that case also; it is then customary to make a continuity correction by shifting the observed value of U x half a unit toward its mean.

10 . When between-sample ties occur, the U obtained from the pair chart is the x same as if half the ties had been broken in one way and half in the other. Hence the corresponding P-va1ue is a compromise between the two extremes of the general rule given above; in practice it may be either too large or too small but is usually very nearly correct. The area under the path in Figure 1 is U x = 10, and a table of the exact distribution gives the corresponding P-va1ue, for the two-sided test as described above, as.081. By the general rule, a lower bound is.043 (for U x = 8) and an upper bound is.142 (for U x = 12). A one-sided test could also have been performed, if the alternative P x < P y had been specified in advance; it would give P-va1ues half as great as the two-sided test. Another method of computation, due to Wilcoxon, requires ranking the observations of the two samples together. f the ranking is done from smallest first, with average ranks used for ties, then (as is now we11- known) we have L: (ranks of XS in combined sample) = U x + m(m+1)/2, L: (ranks of yls in combined sample) = U y + n(n+1)/2. n the example the XS have ranks 1, 3.5, 3.5, 3.5, 6.5, 6.5, 8.5, 13 in the combined sample, with sum 46; and this is equal to U = 10 plus m(m+1)/2 = 36. x The test can then be performed either using tables based directly on the rank sum or after converting to a U. The pair-chart method is at least as convenient computationally, however, unless the samples are large. The Ko1mogorov-Smirnov Procedures. The standard two-sample Ko1mogorov-Smirnov test rejects H : F =G o if too large a value of 9

11 . e is observed, where D- = sup {F (z) - G (z)}, m n -OO<z<= D+ = sup {G (z) -F(z)}; n m -OO<z<o:> the test is consistent against every alternative F ~ G. One-sided tests can - + be based on D and D. Small-sample tabulations of the null-hypothesis distribution, assuming no between-sample ties, are available: and [2J. and n large samples, P{D :: c} for every c > 0, where N = mn/(m+n). 2 exp (-2Nc ), 10 see [5J, [4J, The computational method usually proposed for these tests requires actually drawing the two empirical distribution functions F and G and then m n searching for the greatest vertical separation between them. This may, of course, involve considerable labor. A more convenient computational method is as follows: let (x,y-) be the point on the path in the pair chart farthest below the diagonal (if two or more points are equally far below it, anyone of them may be used), and let (x+, y+) be the point farthest above the diagonal; then D ( x m y ) n and + = ( y m + x ). n The key to understanding this method lies in seeing that vertical distances between two distribution functions are the same as vertical (or horizontal)

12 . distances from their relative distribution function to the diagonal of the unit square. To illustrate, note that the point on the path farthest below the diagonal in Figure 1 is (x-,y-) = (6,1), and hence D = (6/8-1/6) = 7/12. The path is nowhere above the diagonal, so (x +,y + ) may be taken as (0,0) or (8,6), yielding D+ = either way. The P-value corresponding to D = max (D -,D + ) = 7/12 is.139, according to the table of Massey [2J; corresponding to D = 7/12 and D+ = the P-values are.070 and 1.000, respectively. Because of the labor required by the usual computational methods, it is tempting to perform a preliminary grouping of the data; for instance, Siegel [5J does so in both of the examples which he presents. 11 But grouping will tend to increase the number of between-sample ties, and thus to reduce the power of the test. This is because the Kolmogorov-Smirnov tests are conservative when such ties occur; they use the smallest value of D which could possibly result from any breaking of the ties. Thus, in Figure 1, if the crucial tie at X 7 = Y 2 = 7 were broken by making X 7 > Y 2, which would be favorable to the hypothesis, then the value of D would remain the same as found previously, with the corresponding upper bound.139 on the true P-value; but breaking the tie in the opposite way would yield (x-,y-) = (7,1), D = 17/24, and hence the lower bound.043 on the P-value. The pair-chart method, however, yields such a dramatic reduction in computational effort that grouping is no longer necessary. Runs. another. Assume that all between-sample ties have been broken, one way or Then the pair chart will exhibit no sloping lines, but only horizontal and vertical ones, each corresponding to a "run" of observations

13 ., from the same sample. Let R be the total number of these runs. Wa1d and Wo1fowitz [6J showed that under H : Fe G the expected value of R is o (2mn + m + n)/(m + n) and that this is the maximum expected value for any F and G, under certain restrictions; hence they proposed rejecting H for o small observed values of R. Some insight into their results may be provided by contemplating the pair chart, as follows: 12 if F =G then the pair-chart curve should be "close" to the diagonal of the rectangle, and in order to lie as close as possible it must be made up of many short segments which continually cross and recross the diagonal; if it were made up of fewer and longer segments then it would have to lie farther away on the whole. t is not suggested that the pair chart simplifies computatons for this test, however, nor indeed is the runs test to be recommended for practical use, since it is in general less powerful than competitors such as the Ko1mogorov-Smirnov test. Owen [4J For further discussion and tables see the books of Siegel [5J and (note that Siegel's table is valid for the.025 level rather than.05 as stated). n Figure 1 there are 6 or 8 runs, depending on how the crucial tie at X 7 = Y 2 = enough for significance at any usual level. 7 is broken; even 6, however, is not nearly small = -

14 . e [lj [2J [3J [4J [5J [6J [ 7J REFERENCES E. F. Drion, "Some distribution-free tests for the difference between two empirical cumulative distribution functions", Annals of Mathematical Statistics 23, (1952) F. J. Massey, Jr., "Distribution table for the deviation between two sample cumulatives", Annals of Mathematical Statistics 23, (1952) Roy C. Milton, "An extended table of critical values for the Mann Whitney (Wilcoxon) two sample statistic", Journal of the American Statistical Association 59, (1964) D. B. Owen, Handbook of Statistical Tables, Addison-Wesley (1962). Sidney Siegel, Nonparametric Statistics for the Behavioral Sciences, McGraw-Hill (1956). A. Wald and J. Wolfowitz, "On a test whether two samples are from the same population", Annals of Mathematical Statistics 11, (1940) M. B. Wilk and R. Gnanadesikan, "Probability plotting methods for the analysis of data", invited paper presented at joint meeting of American Statistical Association, Biometric Society (ENAR), and nstitute of Mathematical Statistics, Philadelphia, September

Distribution-Free Procedures (Devore Chapter Fifteen)

Distribution-Free Procedures (Devore Chapter Fifteen) MATH-5-01: Probability and Statistics II Spring 018 Contents 1 Nonparametric Hypothesis Tests 1 1.1 The Wilcoxon Rank Sum Test........... 1 1. Normal