5 Introduction to the Theory of Order Statistics and Rank Statistics

Size: px

Start display at page:

Download "5 Introduction to the Theory of Order Statistics and Rank Statistics"

Jade Margery Bradley
6 years ago
Views:

1 5 Introduction to the Theory of Order Statistics and Rank Statistics This section will contain a summary of important definitions and theorems that will be useful for understanding the theory of order and rank statistics. In particular, results will be presented for linear rank statistics. Many nonparametric tests are based on test statistics that are linear rank statistics. For one sample: The Wilcoxon-Signed Rank Test is based on a linear rank statistic. For two samples: The Mann-Whitney-Wilcoxon Test, the Median Test, the Ansari- Bradley Test, and the Siegel-Tukey Test are based on linear rank statistics. Most of the information in this section can be found in Randles and Wolfe (979). 5. Order Statistics Let X, X,..., X n be a random sample of continuous random variables having cdf F (x) and pdf f(x). Let X (i) be the i th smallest random variable (i =,,..., n). X (), X (),..., X (n) are referred to as the order statistics for X, X,..., X n. By definition, X () < X () < < X (n). Theorem 5.: Let X () < X () < < X (n) be the order statistics for a random sample from a distribution with cdf F (x) and pdf f(x). The joint density for the order statistics is g(x (), x (),..., x (n) ) = n! n f(x (i) ) for < x () < x () < < x (n) < () = 0 otherwise Theorem 5.: The marginal density for the j th order statistic X (j) (j =,,..., n) is g j (t) = n! (j )!(n j)! [F (t)]j [ F (t)] n j f(t) < t <. For random variable X with cdf F (x), the inverse distribution F ( ) is defined as F (y) = inf{x : F (x) y} 0 < y <. If F (x) is strictly increasing between 0 and, then there is only one x such that F (x) = y. In this case, F (y) = x. Theorem 5.3 (Probability Integral Transformation): Let X be a continuous random variable with distribution function F (x). The random variable Y = F (X) is uniformly distributed on (0, ). Let X () < X () < < X (n) be the order statistics for a random sample from a continuous distribution. Application of Theorem 5.3, implies that F (X () ) < F (X () ) < < F (X (n) ) are distributed as the order statistics from a uniform distribution on (0, ). 75

2 Let V j = F (X (j) for j =,,..., n. Then, by Theorem 5., the marginal density for each V j has the form g j (t) = n! (j )!(n j)! tj [ t] n j < t < because F (t) = t and f(t) = for a uniform distribution on (0, ). Thus, V j has a beta distribution with parameters α = j and β = n j +. Therefore, the moments of V j are E(Vj r n! Γ(r + j) ) = (j )! Γ(n + r + ) where Γ(k) = (k )!. Thus, when V j is the j th order statistic from a uniform distribution, E(V j ) = j n + V ar(v j ) = j(n j + ) (n + ) (n + ) Simulation to Demonstrate Theorem 5.3 (Probability Integral Transformation) Case : N(0, ) Distribution. Generate a random sample (x, x,..., x 5000 ) of 5000 values from a normal N(0, ) distribution.. Determine the 5000 empirical cdf F (x i ) values. 3. Plot the histograms and empirical cdf of the original N(0, ) sample. Note how they represent a sample from a standard normal distribution. 4. Plot the histograms and empirical cdf of the F (x i ) values. Note the histograms and empirical cdf of the F (x i ) values represent a sample from a uniform U(0, ) distribution (as supported by Theorem 5.3). Case : Exp(4) Distribution. Generate a random sample (x, x,..., x 5000 ) of 5000 values from an exponential Exp(4) distribution.. Determine the 5000 empirical cdf F (x i ) values. 3. Plot the histograms and empirical cdf of the original Exp(4) sample. Note how they represent a sample from an exponential Exp(4) distribution. 4. Plot the histograms and empirical cdf of the F (x i ) values. Note the histograms and empirical cdf of the F (x i ) values represent a sample from a uniform U(0, ) distribution (as supported by Theorem 5.3). 76

3 Histogram of N(0,) Sample Histogram of CDF of N(0,) Sample) Frequency Frequency x Fx ECDF of N(0,) Sample ECDF(ECDF of N(0,) Sample) Fn(x) Fn(x) Histogram of Exp(4) Sample x Histogram of CDF of Exp(4) Sample) x Frequency Frequency x Fx ECDF of Exp(4) Sample ECDF(ECDF of Exp(4) Sample) Fn(x) Fn(x) x x 77

4 R Code for Simulation of Theorem 5.3 (Probability Integral Transformation) n = 5000 # size of random sample # CASE : Random Samples from N(0,) Distribution x <- rnorm(n,0,) x[:0] # view first 0 values Fx <- pnorm(x) Fx[:0] windows() par(mfrow=c(,)) hist(x,main="histogram of N(0,) Sample") hist(fx,main="histogram of CDF of N(0,) Sample)") plot(ecdf(x),main="ecdf of N(0,) Sample") plot(ecdf(fx),main="ecdf(ecdf of N(0,) Sample)") # CASE : Random Samples from Exponential(4) Distribution x <- rexp(n,4) x[:0] # view first 0 values Fx <- pexp(x,4) Fx[:0] windows() par(mfrow=c(,)) hist(x,main="histogram of Exp(4) Sample") hist(fx,main="histogram of CDF of Exp(4) Sample)") plot(ecdf(x),main="ecdf of Exp(4) Sample") plot(ecdf(fx),main="ecdf(ecdf of Exp(4) Sample)") 5. Equal-in-Distribution Results Two random variables S and T are equal in distribution if S and T have the same cdf. To denote equal in distribution, we write S = d T. Theorem 5.4 A random variable X has a distribution that is symmetric about some number µ if and only if (X µ) = d (µ X). Theorem 5.5 Let X, X,..., X n be independent and identically distributed (i.i.d.) random variables. Let (α, α,..., α n ) denote any permutation of the integers (,,..., n). Then (X, X,..., X n ) = d (X α, X α,..., X αn ). A set of random variables X, X,..., X n (α, α,..., α n ) of the integers,,..., n, is exchangeable if for every permutation (X, X,..., X n ) = d (X α, X α,..., X αn ). If X, X,..., X n are i.i.d random variables, then the set X, X,..., X n is exchangeable. The statistic t( ) is. a translation statistic if t(x + k, x + k,..., x n + k) = t(x, x,..., x n ) + k. a translation-invariant statistic if t(x + k, x + k,..., x n + k) = t(x, x,..., x n ) for every k and x, x,..., x n. 78

5 5.3 Ranking Statistics Let Z, Z,..., Z n be a random sample from a continuous distribution with cdf F (z), and let Z () < Z () < < Z (n) be the corresponding order statistics. Z i has rank R i among Z, Z,..., Z n if uniquely defined. Z i = Z (Ri ) assuming the R th i order statistic is By uniquely defined we are assuming that ties are not possible. That is, Z (i) Z (j) for all i j. Let R = {r : r is a permutation of the integers (,,..., n)}. That is, R is the set of all permutations of the integers (,,..., n). Theorem 5.6 Let R = (R, R,..., R n ) be the vector of ranks where R i is the rank of Z i among Z, Z,..., Z n. Then R is uniformly distributed over R. That is, P (R = r) = /n! for each permutation r. Theorem 5.7 Let Z, Z,..., Z n be a random sample from a continuous distribution, and let R be the corresponding vector of ranks where R i is the rank of Z i for i =,,..., n. Then and, for i j, P [R i = r] = /n for r =,,..., n = 0 otherwise P [R i = r, R j = s] = for r s, r, s =,,..., n n(n ) = 0 otherwise Corollary 5.8 Let R be the vector of ranks corresponding to a random sample from a continuous distribution. Then E[R i ] = n + and V ar[r i ] = Cov[R i, R j ] = (n + ) (n + )(n ) for i j. for i =,,..., n Let V, V,..., V n be random variables with joint distribution function D, where D is a member of some collection A of possible joint distributions. Let T (V, V,..., V n ) be a statistic based on V, V,..., V n. The statistic T is distribution-free over A if the distribution of T is the same for every joint distribution in A. Corollary 5.9 Let Z, Z,..., Z n be a random sample from a continuous distribution, and let R be the corresponding vector of ranks. If V (R) is a statistic based only on R, then V (R) is distribution-free over the class A of joint distributions of n i.i.d. continuous random variables. A statistic (such as V (R)) that is a function of Z, Z,..., Z n only through the rank vector R is called a rank statistic. 79

6 Example of a distribution-free statistic: Let X, X,..., X n and Y, Y,..., Y m be independent random samples from continuous distributions with cdfs F (x) and G(x) = F (x ), respectively ( < < ). That is, is a shift parameter. Combine the X and Y samples. Let R i (i =,,..., n) and Q j (j =,,..., m) be the ranks of the n X-values and the m Y -values in the combined sample. Thus, R i and Q j take on values,,..., (m + n). Thus, the rank vector R = (R, R,..., R n, Q, Q,..., Q m ) is simply a permutation of the integers (,,..., (m + n)) which satisfy the constraint R i + m Q j = j= m+n k= k = (m + n)(m + n + ). To construct a test for H 0 : = 0 vs H : > 0 based on the ranks in rank vector R, we compare the X-ranks (R, R,..., R n ) to the Y -ranks (Q, Q,..., Q m ). If we know the X-ranks (R, R,..., R n ), then we also know the Y -ranks. Thus, it will be sufficient to consider a statistic based only on the X-ranks, say W (R, R,..., R n ). The test statistic proposed by Wilcoxon is W = X-ranks. W is known as a ranksum statistic. R i. That is, W is the sum of the Note that the statistic W is a function of the data only through the rank vector R = (R, R,..., R n, Q, Q,..., Q m ). That is, once we have R, we no longer need (X, X,..., X n, Y, Y,..., Y m ) to calculate W. If H 0 : = 0 is true, then the data X, X,..., X n, Y, Y,..., Y m are i.i.d. continuous random variables. Applying Corollary 5.9, the rank statistic W is distribution-free over the class A of all continuous distributions. That is, for any continuous cdf F A, the distribution of W does not depend on the choice of F. Theorem 5.0: Let W be the rank sum statistic when X, X,..., X n and Y, Y,..., Y m are independent random samples from F (x) and G(y) = F (y ), respectively. If H 0 : = 0 is true, then the discrete distribution of W is given by P 0 [W = w] = t m,n(w) ) for w = ( m+n n = 0 otherwise n(n + ), n(n + ) +,..., n(m + n + ) where t m,n (w) is the number of subsets of n integers selected without replacement from (,,..., (m+ n)) such that their sum = w. Thus, to calculate P 0 [W = w] for a given m and n, we need to (i) generate all ( ) m+n n possible assignments of (m + n) ranks to the X and Y observations, (ii) calculate W for each assignment, and (iii) count the number of cases where W = w. For example consider the case with n = and m = 4. There are ( 6 ) = 5. Thus, there will be two X-ranks (R, R ) from the six possible ranks (,, 3, 4, 5, 6). W = R + R is then calculated for all possible assignments of the 6 ranks. 80

7 The following table shows the 5 assignments of the 6 ranks and the corresponding W statistic values. X-ranks Y -ranks X-ranks Y -ranks R, R Q, Q, Q 3, Q 4 W = R + R R, R Q, Q, Q 3, Q 4 W = R + R 5,6,,3,4,4,3,5,6 6 4,6,,3,5 0,3,4,5,6 5 4,5,,3,6 9,6,3,4,5 7 3,6,,4,5 9,5,3,4,6 6 3,5,,4,6 8,4,3,5,6 5 3,4,,5,6 7,3,4,5,6 4,6,3,4,5 8, 3,4,5,6 3,5,3,4,6 7 For each of the 5 unordered assignments of ranks within samples, there are 4!! = 48 ordered assignments yielding the same W value. Thus, overall there are 6! = 70 = (5)(48) ordered assignments of the 6 ranks. The distribution of W is w P 0 [W = w] /5 /5 /5 /5 3/5 /5 /5 /5 /5 Suppose that W = 9. Then for the test of H 0 : = 0 vs H : > 0 : p value = the probability of getting a test statistic W that is at least 9 = /5 + /5 + /5 = 4/5.7. { } n(n + ) n(n + ) n(m + n + ) Note that w {3, 4,..., } =, +,..., as stated in Theorem 5.0. Theorem 5. Let W = be the ranksum statistic. If H 0 : = 0 is true (i.e. F = G), j= then the distribution of W is symmetric about the value µ = n(m + n + )/ and E 0 [W ] = µ V ar[w ] = mn(m + n + ) Statistics Based on Counting and Ranking Let X, X,..., X n be a random sample from a continuous distribution that is symmetric about value µ. Let Z, Z,..., Z n = (X µ, X µ,..., X n µ). Then Z, Z,..., Z n is a random sample that is symmetric about 0. Define Ψ i = Ψ(Z i ) to be an indicator variable where Ψ(t) = if t > 0 and Ψ(t) = 0 if t 0 8

8 Lemma 5. Let Z be a random variable that is symmetrically distributed about 0. Then the random variables Z and Ψ = Ψ(Z) are stochastically independent. That is, P (Ψ =, Z t) = P (Ψ = )P ( Z t) and P (Ψ = 0, Z t) = P (Ψ = 0)P ( Z t). For random variables Z, Z,..., Z n, the absolute rank of Z i, denoted R + i Z i among Z, Z,..., Z n., is the rank of The signed rank of Z i is Ψ i R + i. Thus, (i) Ψ i = Z i if Z i > 0 and (ii) Ψ i = 0 if Z i 0. A signed rank statistic is a statistic that is a function of Ψ R +, Ψ R +,..., Ψ n R + r. The following theorem establishes properties of the joint distribution of Ψ = (Ψ, Ψ,..., Ψ n ) and R + = (R +, R +,..., R + n ). Theorem 5.3 Let Z, Z,..., Z n be a random sample from a continuous distribution that is symmetric about 0. Then Ψ, Ψ,..., Ψ n, R + are mutually independent. Moreover, each Ψ i is a Bernoulli random variable with p = /, and R + is uniformly distributed over R (the set of all permutations of the integers (,,..., n)). Proof of Theorem Z, Z,..., Z n are are independent because they are a random sample. Lemma 5. implies that Ψ, Z, Ψ, Z,..., Ψ n, Z n are n mutually independent random variables. - Each Ψ i is a Bernoulli random variable with parameter p = P [Z i > 0] = / because Z i is continuous and symmetrically distributed about 0. - The R + is independent of Ψ, Ψ,..., Ψ n because it is a function only of Z, Z,..., Z n. That is, R + does not depend on any Ψ i. - Because R + is a rank vector of n i.i.d. continuous random variables, application of Theorem 5.6 shows that R + is uniformly distributed over R (the set of permutations of the integers (,,..., n). Let A 0 be the set of joint distributions of n i.i.d. continuous random variables that are symmetrically distributed about 0. Corollary 5.4 Let S(Ψ, R + ) be a statistic that depends on Z, Z,..., Z n only through Ψ = Ψ, Ψ,..., Ψ n and R + = (R +, R +,..., R n + ). Then the statistic S( ) is distribution-free over A 0. Proof of Corollary 5.4 This result follows from Theorem 5.3 because Ψ and R + have the same joint distribution for every joint distribution F 0 (Z, Z,..., Z n ) A 0. That is, the joint distribution of Ψ and R + does not depend on the choice of F 0 (Z, Z,..., Z n ) A 0. We will often be interested in functions of Ψ and R + that are symmetric functions of the signed ranks Ψ R +, Ψ R +,..., Ψ n R + n. If this is the case, then the following theorem can help establish the distribution of such a statistic. 8

9 Theorem 5.5 Let Z, Z,..., Z n be a random sample from a continuous distribution that is symmetric about 0. Let Q be the number of positive Zs. For Q = q, let S < S < < S q denote the ordered absolute ranks of those Zs that are positive (i.e., S < S < < S q are the positive signed ranks in numerical order). Then P [Q = q, S = s, S = s,..., S q = s q ] = (/) n for q = 0,,..., n and each of the q tuples (s, s,..., s q ) such that s i is an integer and s < s < < s q n = 0 otherwise Recall: Suppose X, X,..., X n be a random sample from a continuous distribution that is symmetric about µ. Then Z, Z,..., Z n = (X µ, X µ,..., X n µ) is a random sample that is symmetric about 0. Thus, all of the preceding results also apply to the (X i µ) random variables. That is, we can generalize the results to A µ = the class of continuous distributions that are symmetric about µ for any < µ <. Example: Suppose we have a random sample X, X,..., X n from a distribution in A µ. The Wilcoxon signed rank statistic W + is defined as W + = Ψ i R + i. That is, W + is the sum of the signed ranks. To test H 0 : µ = µ 0 vs H : µ > µ 0, we would reject H 0 if W + is too large. That is, we would reject H 0 if the p-value is small (e.g., p-value <.05). So how do we calculate the p-value? Corollary 5. Let W + be the Wilcoxon signed rank statistic for testing H 0 : θ = θ 0. For a random sample of size n, the distribution of W + assuming H 0 is true is P 0 [W + = k] = c n(k) for k = 0,,..., n = 0 otherwise n(n + ) where c n (k) = the number of subsets of integers {,,..., n} for which W + is equal to k. Suppose n = 4. The following table list the 4 combinations of signed ranks and the corresponding W + values. Subset of {,, 3, 4} W + Subset of {,, 3, 4} W + 0 {,3 } 5 {} {,4} 6 {} {3,4} 7 {3} 3 {,,3} 6 {4} 4 {,,4} 7 {,} 3 {,3,4} 8 {,3} 4 {,3,4} 9 {,4} 5 {,,3,4} 0 83

10 Thus, the distribution of W + is k P [W + = k] Suppose the data are (X, X, X 3, X 4 ) = (4.6, 5., 5.6, 5.7), and we want to test H 0 : µ = 5 vs H : µ > 5. Next calculate the deviations from µ 0 = 5. That is, (Z, Z, Z 3, Z 4 ) = (.4,.,.6,.7). and the vector of absolute values is ( Z, Z, Z 3, Z 4 ) = (.4,.,.6,.7). The absolute rank vector R + = (R +, R +, R + 3, R + 4 ) = (,, 3, 4). Ψ i = if Z i > 0 (or equivalently, if X i > 5)), and is 0 otherwise. Thus, (Ψ, Ψ, Ψ 3, Ψ 4 ) = (0,,, ). Therefore the signed rank statistic W + = Ψ i R + i is W + = (0)() + ()() + ()(3) + ()(4) = 8. The p-value is the probability of getting a W + value that is at least 8. Therefore, the p-value = P [W + = 8, 9, or 0] = ( + + )/ = 3/ =.875. Theorem 5.7 The distribution of the Wilcoxon signed rank statistic W + is symmetric about its mean µ W + = [n(n + )/4] if H 0 : µ = µ 0 is true. 5.4 Linear Rank Statistics Earlier we studied the ranksum statistic W = combined sample X, X,..., X n, Y, Y,..., Y m. R i where R i is the rank of X i among a If H 0 : = 0 is true, then the random variables X, X,..., X n, Y, Y,..., Y m are i.i.d, and by Corollary 5.9, W is distribution-free over the class of continuous distributions A. The test statistic W has two important properties:. W maintains the desired α-level over a very broad class of distributions (A).. The power of W is excellent for detecting a shift for many distributions, especially for a medium-tailed distribution (such as the normal or logistic). We now consider a general class of rank statistics (which includes W ). Let R = (R, R,..., R N ) be a vector of ranks. Let a(), a(),..., a(n) and c(), c(),..., c(n) be two sets of n constants. A statistic of the form S = c(i) a(r i ) is called a linear rank statistic. The constants a(), a(),..., a(n) are called the scores, and c(), c(),..., c(n) are called the regression constants. The choice of c(), c(),..., c(n) will depend on the specific testing problem of interest. 84

11 Case I: In two-sample problems R is the rank vector of X, X,..., X n, Y, Y,..., Y m. In general, let R, R,..., R n be the ranks of X, X,..., X n and R n+, R n+,..., R m+n be the ranks of Y, Y,..., Y m. If c(i) = for i =,,..., n (7) = 0 for i = n +, n +,..., m + n then S = m+n c(i) a(r i ) = the ranks of X, X,..., X n. a(r i ) which is the sum of the scores associated with The constants c(i) in (7) are called two-sample regression constants. Case II: For Case I, if we also let a(i) = i for i =,,..., m + n, then S = the ranksum statistic W. The scores a(i) = i are called the Wilcoxon scores. R i which is Case III: It is clear that a different choice of a(), a(),..., a(n) scores for the two-sample problem will yield a test statistic with different properties. Let M = the median of the combined sample X, X,..., X n, Y, Y,..., Y m, and define a(i) = 0 if i m + n + = if i > m + n + Consider S with these a(i) scores and the two-sample regression constants in Case I: S = a(r i ) = the number of X i values larger than the sample median M This S is the linear rank statistic for the two-sample median test, and the scores in (8) are called the median scores. (8) 5.4. Linear Rank Statistics under H 0 In this section, general properties of linear rank statistics will be studied under the null hypothesis where null hypothesis refers to any set of assumptions that will result in the rank vector R being uniformly distributed over R (the set of permutations of the integers,,..., N). In future sections, we will study the null hypothesis for specific testing problems. 85

12 Lemma 5.8 Let a(), a(),..., a(n) be a set of N constants. Then, if R is uniformly distributed over permutation set R, E[a(R i )] = N V ar[a(r i )] = N Cov[a(R i ), a(r j )] = a(i) = a (a(i) a) k= N(N ) (a(i) a) = k= for i =,,..., N N V ar[a(r i)] for i j The proof of Lemma 5.8 involves using Theorem 5.7 and the definitions of E( ), V ar( ), and Cov(, ). Lemma 5.8 is used to establish the mean and variance of a linear rank statistic under the null hypothesis. Theorem 5.9 Let S be a linear rank statistic with regression constants c(), c(),..., c(n) and scores a(), a(),..., a(n). If R is uniformly distributed over R, then where a = (/N) E[S] = N ca and [ N ] [ N ] V ar[s] = (c(i) c) (a(k) a) N k= a(i) and c = (/N) c(i). 5.5 Asymptotic Normality of Rank Statistics (Supplemental) The regression constants c(), c(),..., c(n) are determined by the problem of interest. Thus, we will only place a weak restriction on these constants. The restriction essentially requires that asymptotically no individual c i value is much larger than the other constants. Specifically, the restriction is N (c(i) c) as N (9) max i n (c(i) c) where (/N) c i. This is known as Noether s condition. Let φ be a real-valued function defined on (0, ) that (i) does not depend on N, (ii) can be written as the difference φ = φ i φ of two non-decreasing functions, and (iii) satisfies 0 < 0 [ ] φ(u) φ du < with φ = φ(u)du. A function φ( ) with these properties is called a square integrable score function. 86 0

13 [ ] For a square integrable function, φ(u) φ du = φ (u)du [(φ)]. 0 0 Let φ be a square integrable score function and a(), a(),..., a(n) be scores that satisfy any of the following three conditions: ( ) i (A) a(i) = φ. N + (A) a(i) = N i/n (i i)/n φ(u)du for i =,,..., N. (A3) a(i) = E[φ(U (i) )] where U (i) is the i th order statistic from a random sample of size N from a uniform (0, ) distribution. Let S = Let S + = c(i) a(r i ). c(i) Ψ(i) a(r i ). Theorem 5.0 (Asymptotic Normality of Linear Rank Statistics): Under H 0 for a linear rank statistic S, and assuming Noether s condition and condition A, A or A3, then S E(S) V ar(s) d N(0, ) as N Theorem 5. (Asymptotic Normality of Signed Rank Statistics): Under H 0 for a linear rank statistic S +, and assuming Noether s condition and condition A, A or A3, then S + E(S + ) V ar(s+ ) d N(0, ) as N The linear rank statistics and signed rank statistics discussed in this course all all have asymptotic N(0, ) distributions after standardizing. 87

Non-parametric Inference and Resampling

Non-parametric Inference and Resampling Exercises by David Wozabal (Last update. Juni 010) 1 Basic Facts about Rank and Order Statistics 1.1 10 students were asked about the amount of time they spend surfing