Confidence Regions For The Ratio Of Two Percentiles

Size: px

Start display at page:

Download "Confidence Regions For The Ratio Of Two Percentiles"

Melvin Hampton
5 years ago
Views:

1 Confidence Regions For The Ratio Of Two Percentiles Richard Johnson Joint work with Li-Fei Huang and Songyong Sim January 28, 2009

2 OUTLINE Introduction Exact sampling results normal linear model case Other general distributions are treated by large sample methods. Nonparametric Bayes expansion of posterior distribution of quantile.

3 Introduction Wood engineers use the ratio of the fifth or other lower percentiles when comparing lumber of two different grades, moisture content, or sorted under two different grading systems. Motivated by this application, we develop some exact, and some approximate, confidence regions for the ratio of percentiles from two different populations. This work generalizes earlier work on inferences concerning the ratio of two means.

4 4 Most of the existing literature on ratios deals with the ratio of means and, in particular, ratios of means of normal distributions. Fieller(1954) treats bivariate normal distributions. McDonald(1981) obtained confidence regions for the ratio of means in the normal theory setting of fitting a straight line.

5 Hwang(1995) used a resampling approach to construct confidence regions for this same ratio. Huang and Johnson(2004) study the two sample normal theory case. Evans, Johnson, Green and Verrill obtain large sample confidence intervals for the ratio of Weibull percentiles.

6 6 Fieller(1954), using a fudical approach, considered T (θ) = n(x θy ) S11 2θS 12 + θ 2 S 22 which has a central t distribution when θ = µ 1 /µ 2 prevails and (X i, Y i ) bivariate normal. A confidence region consists of all θ 0 such that H 0 : θ = θ 0 is not rejected by a two-sided test using the central t distribution. Unfortunately, T (θ) is not monotone in θ. This sometimes results in a region that is the complement of a finite interval or even the whole real line.

7 Some exact results Assumptions X 1,..., X n1 is a random sample from the cumulative distribution function(cdf) F 1 ( ) Y 1,..., Y n2 is a random sample from F 2 ( ), The two random samples be independent. Initially we allow for different percentiles, or p 1 and p 2, to be specified. ξ 1p1 = inf{x : F 1 (x) p 1 } ξ 2p2 = inf{y : F 2 (y) p 2 } The ratio of population percentiles is θ = ξ 1p1 /ξ 2p2

8 Normal distributions with equal variances Further, let F i be N(µ i, σi 2 ), for i = 1, 2. Then, ξ ipi = µ i + σ i Φ 1 (p i ) is the lower 100p i -th percentile i = 1, 2 where Φ( ) is the standard normal cdf. We also let z pi be the upper 100p i -th percentile of the standard normal distribution, so ξ ipi = µ i z pi σ i for i = 1, 2.

9 θ = ξ 1p1 /ξ 2p2 with ξ ipi = µ i z pi σ i for i = 1, 2. Theorem If the value of k = σ 1 is known, σ 2 T (θ, k) = 1 n 1 + θ2 n 2 k 2 1 X θy /k k 2 n 1 i=1 (X i X ) 2 + n 2 j=1 (Y i Y ) 2 n 1 + n 2 2 follows the non-central t distribution with n 1 + n 2 2 degrees of freedom and non-centrality parameter 1 δ(θ, k) = 1 + θ2 (z n 1 n 2 k 2 p1 θ k z p 2 ) 9

10 10 Remark If the two variances are the same, that is, k = 1, then T (θ) = 1 + θ2 n 1 n 2 1 X θy n1 i=1 (X i X ) 2 + n 2 j=1 (Y i Y ) 2 n 1 +n 2 2 follows the non-central t distribution with n 1 + n 2 2 degrees of freedom and non-centrality parameter 1 δ(θ) = 1 + θ2 (z p1 θz p2 ). n 1 n 2

11 Proof Here k = σ 1 σ 2 and θ = µ 1 z p1 σ 1 µ 2 z p2 σ 2. Under the assumptions, the random quantity X θy kσ 2 distributed as N(0, 1 n 1 + µ 1 µ 2θ σ 1 = z p1 θ σ 1 k z p 2 and n1 i=1 (X i X ) 2 σ 2 1 is θ2 n 2 k2) plus the constant + n2 j=1 (Y j Y ) 2 n1 i=1 = (X i X ) 2 n2 j=1 k 2 σ2 2 + (Y j Y ) 2 σ2 2 is independently distributed as χ 2 n 1 +n 2 2(0). σ 2 2

12 12 The result follows by the definition of the non-central t as a ratio of these random quantities. ( see Johnson and Kotz(1970), pp ) A 100(1 α)% confidence region for θ consists of all θ 0 such that the equal-tailed level α test of H 0 : θ = θ 0 versus H 1 : θ θ 0 does not reject H 0. The confidence region can be a finite interval, the complement of an interval, or the real line.

14 Table 1 illustrates possible confidence regions for the ratio of percentiles from two normal distributions having equal variances. Possible regions include a bounded interval, the complement of an interval, and the whole real line. The latter two cases occur when the estimated percentile in the denominator and both percentiles are too close to zero, respectively. Table 1 also shows that the regions for ξ 2,p2 /ξ 1,p1 could be obtained for the ratio ξ 1,p1 /ξ 2,p2. 14

15 95% confidence regions normal populations with equal variances. n 1 = n 2 = 10. s pooled = 2 x = 5, y = 10 x = 10, y = 5 means (p 1 = p 2 = 0.5) (0.359, 0.659) (1.518, 2.786) 20th percentiles (p 1 = p 2 = 0.2) (0.206, 0.574) (1.741, 4.870) 10th percentiles (p 1 = p 2 = 0.1) (0.075, 0.515) (1.939, ) 5th percentiles (p 1 = p 2 =.05) ( 0.087, 0.458) 15 (, ) (2.183, )

16 Large-Sample Confidence Regions When large samples are available, we can obtain confidence intervals for the ratio of percentiles even when the ratio of variances, k 2, is unknown. The first result still requires normal populations. The 100p i -th percentile of population i is µ i z pi σ i for i = 1, 2.

17 It always gives an interval. 17 Let X 1,..., X n1 be a random sample from a normal distribution with mean µ 1 and variance σ1 2; let Y 1,..., Y n2 be a random sample from a normal distribution with mean µ 2 and variance σ2 2 ; and let the samples be independent. n 1 If lim = λ, 0 < λ < 1, and ξ 2p2 0, n 1,n 2 n 1 + n 2 an application of the delta method gives an approximate 100(1 α)% confidence interval for θ. x z p1 s 1 ±z α/2 y z p2 s 2 ( 1 + z ) p n 1 s 2 1 (y z p2 s 2 ) 2 + ( 1 + z 2 p 2 2 n 2 ) s 2 2 (x z p 1 s 1 ) 2 (y z p2 s 2 ) 4

18 An Alternative Large-Sample Approach We begin by considering the random quantity X z p1 S 1 θ(y z p2 S 2 ) and derive large sample confidence regions that may be intervals, complements of intervals, or sometimes even the whole real line.

19 Theorem When µ 1 z p1 σ 1 0 and n 1 lim = λ, 0 < λ < 1, n 1,n 2 n 1 + n 2 n1 + n 2 ( X z p1 S 1 θ(y z p2 S 2 ) ( ξ 1,p1 θ ξ 2,p2 ) ) converges in distribution to the normal with mean 0 and variance 1 λ (1 + z 2 p 1 2 )σ2 1 + θ 2 1 (1 λ) (1 + z 2 p 2 2 )σ2 2

20 20 Under the null hypotheses H 0 : θ = ξ 1,p1 /ξ 2,p2, the test statistic Z n1,n 2 (θ) = X z p 1 S 1 θ(y z p2 S 2 ) (1 + z p )S 1 m + θ2 (1 + z p )S 2 n is, asymptotically, standard normal. (1)

21 A large sample 100 (1 α)% confidence region for θ is then given by the collection of θ where {θ : Z n1,n 2 (θ) z α/2 } As a function of θ, Z n1,n 2 (θ) has a single maximum which is, in absolute value, greater than the asymptotes. Let θ = argmax Z n1,n 2 (θ) lim Z n 1,n 2 (θ) = Y z p 2 S 2 θ (1 + z 2 p 2 2 )S 2 2 n 2 lim Z n 1,n 2 (θ) = Y z p 2 S 2 θ 21 (1 + z 2 p 2 2 )S 2 2 n 2

22 22 Further, the derivative converges to 0 as θ ± and has the same sign as X z p1 S 1 for sufficiently large negative θ. The confidence region is (1) an interval if z α/2 < (Y z p2 S 2 ) < Z (1 + z n1 p 2 2,n 2 (θ ) 2 2 )S 2 n 2

23 23 (2) the complement of an interval Y z p2 S 2 if z (1 + z p 2 2 α/2 < Z n1,n 2 (θ ) 2 2 )S 2 n 2 (3) the whole real line Y z p2 S 2 if < Z (1 + z n1 p 2 2,n 2 (θ ) z α/2 2 2 )S 2 n 2

24 24 We see that the region fails to be a bounded interval if the percentile in the denominator is not significantly different from zero, or, Y z p2 S 2 z α/2 (1 + z 2 p 2 2 )S 2 2 n 2

25 A Large Sample Nonparametric Approach When sample sizes are large, we can also take a nonparametric approach. Let [ w ] denote the smallest integer that is greater than or equal to w. We estimate the ξ 1p1 by order statistics. ˆξ 1p1 = X (r1 ) where r 1 = [ n 1 p 1 ] ξ 2,p2 ˆ = Y (r2 ) where r 2 = [ n 2 p 2 ]. Johnson and Huang (2004) consider a delta method approach and a modified approach based on the normal limit for n1 + n 2 ( ˆξ 1p1 θ ˆξ 2p2 ( ξ 1,p1 θ ξ 2,p2 ) )

26 Examples and Simulation We consider data on the modulus of rupture(mor) of Douglas Fir specimens, provided to us by J. Evans, whose percentile estimates were given in Aplin et. al. (1986). The normal plots suggest that data are normally distributed and it is reasonable to assume the variances are equal. The two data sets are

27 MOR(lb/in 2 ) of 2 4 inches, Grade 2, Green(30 % moisture content)

28 MOR(lb/in 2 ) of 2 6 inches, Select Structural, Green ( 30 % moisture content)

29 29 From the summary statistics n 1 = 107 x = s 2 1 = 2, 354, 470 n 2 = 103 y = s 2 2 = 2, 528, 864 s pooled = x (6) = y (6) = we determine 95% confidence intervals for ratio of 5th percentile of the MOR of 2 4 s to that of 2 6 s. Pametric Nonparametric (.395,.628) (.400,.635)

30 Regions other than Intervals Simulation with 5000 trials. n 1 = n 2 = 100 µ 1 = 1.93, σ 1 = 2, ξ 1,0.05 = 1.36 µ 2 = 1.93, σ 2 = 1, ξ 2,0.05 = where the denominator percentile is close to zero Estimated Coverage Probabilities ( estimated standard deviations in parentheses.) Large sample Modified Z n1,n 2 (θ) Z n1,n 2 (θ) 87.58% 94.86% (.4664) (.3123) Large sample Modified Nonparametric Nonparametric 88.82% 93.18% (.4456) (.3565)

31 Since the unmodified versions do not have good coverage, it is interesting to compare with the behavior of the two modified procedures. Modified outside Z n1,n 2 (θ) interval interval real line overall number N i =2329 N c =2669 N r =2 N t =5000 coverage number percent 92.19% 97.19% 100% 94.86% coverage (0.3795%) (0.2337%) (0) (0.3123%) Modified outside Nonparametric interval interval real line overall number N i =1582 N c =3376 N r =42 N t =5000 coverage number percent 85.97% 96.48% 100% 93.18% coverage (0.4912%) (0.2606%) (0) (0.3565%)

32 A BAYESIAN NONPARAMETRIC APPROACH In the nonparametric setting, we select a Dirichlet process prior for the single sample problem. We begin by showing how to obtain a sample of several different percentiles fromthe posterior distribution. Examples of approximation to the joint posterior distribution of one and two percentiles are given. We also investigate the large sample behavior of the posterior quantile process for a single population.

33 33 To compare percentiles from two different populations, we select separate Dirichlet process prior distributions for each of the populations. An observation is selected from each of the posterior distributions and the appropriate function of the percentiles formed. By repeated sampling, we obtain an approximation to the posterior distribution of, for instance, the ratio of percentiles. An example is given. For any cumulative distribution function(cdf) F, the 100p-th percentile is ξ p = F 1 (p) = inf{x F (x) p} 0 < p < 1

34 34 The posterior distribution will be based on a random sample X = (X 1, X 2,..., X n ) from a distribution G( ). We consider a Dirichlet process, with prior measure α 0, over the possible cdf s F. Let t 1 < t 2 < < t k. Given the observations X 1, X 2,..., X n, we set α(t i ) = α(, t i ] = #(X j t i ) + α 0 (t i ) for 1 i k. For simplicity, we set α(t i, t i+1 ) = α(, t i+1 ] α(, t i ] and denote α R = α 0 (R).

35 35 By Ferguson (1973), the posterior distribution of F is a Dirichlet process with the measure α. Then, for i < j, it is well known that F (t i ) Beta(α(t i ), n + α R α(t i )) F (t j ) F (t i ) Beta(α(t i, t j ), n + α R α(t i, t j )) F (t j ) F (t i ) 1 F (t i ) F (ti ) Beta(α(t i, t j ), n+α R α(t i ) α(t i, t j )).

36 Generation of Values of a Percentile from the Posterior Distribution By Ferguson(1973), the posterior distribution of F (t 1 ) is Beta(α(t 1 ), n + α R α(t 1 )) so Pr[ξ p1 t 1 ] = Pr[p 1 F (t 1 )] Hence we generate ξ p1 as follows; 1 Generate a value u 1 from U(0, 1). 2 Numerically, find t 1 such that Γ(n + α R ) u 1 = Γ(α(t 1 ))Γ(n + α R α(t 1 )) 1 p 1 Take t 1 as ξ p1. z α(t 1) 1 (1 z) n+α R α(t 1 ) 1 dz 36

37 Illustration Consider the ratio of 5th percentiles. As data, we generated a sample of size 100 from a N(4, 1) population representing the first population. Let x (1) x (2) x (100) be the order statistics. We also generated a second sample of size 100 from the same normal population to represent the sample from the second population.

38 Next, we select a value u from a uniform(0,1) distribution and then numerically solve u = P 1 [ξ 1p1 t] for t which is then an observed value from the posterior distribution of the 5th percentile from the first population. 38

39 39 Most of the time, the generated value is one of the order statistics near the sample 5th percentile. Table 1 shows the first 1000 values selected from the posterior distribution. Notice that 98.1 % are actual values in the data set. Table of first 1,000 generated values from the first posterior distribution when the data are 100 N(4, 1) observations. Out of 1,000 generations, 981 are the data values.

40 40 Value Value of ξ 0.05 Freq Property of ξ 0.05 Freq Property x (1) interpolated interpolated x (6) x (2) interpolated interpolated interpolated interpolated interpolated interpolated x (7) interpolated interpolated interpolated x (8) x (3) x (9) x (4) interpolated interpolated x (10) interpolated interpolated interpolated x (11) interpolated x (12) interpolated x (13) interpolated x (14) x (5)

41 Suppose we wish to explore the joint posterior distribution of two percentiles (ξ p1, ξ p2 ) with p 1 < p 2. To generate random values for (ξ p1, ξ p2 ) from the posterior distribution 1 Generate a value t 1 from Beta(α(t 1 ), n + α R α(t 1 )) and call it ξ p1. Implementation is as above. 2 Given the value ξ p1 generated in the previous step, we next generate a value from Beta(α(t 1, t 2 ), n + α R α(t 1 ) α(t 1, t 2 )) and call it η. Take ξ p1 + η as ξ p2. Here η = t 2 t 1.

42 To implement this procedure, we note that the conditional posterior probability Pr[ξ p2 ξ p1 η F (t 1 ) = p 1 ] = Pr[ξ p2 t 1 + η F (t 1 ) = p 1 ] = Pr[p 2 F (t 1 + η) F (t 1 ) = p 1 ] = Pr[ p2 p 1 1 p 1 F (t 1+η) p 1 1 p 1 F (t1 ) = p 1 ] = Pr[ p2 p 1 1 p 1 F (t 1+η) F (t 1 ) 1 F (t 1 ) ] F (t 1 ) = p 1 and a value of ξ p2 given ξ p1 = F 1 (p 1 ) = t 1 can be generated as follows; 42

43 1 Generate U 2 = u 2 from U(0, 1). 2 Find η such that 1 u 2 = (p 2 p 1 )/(1 p 1 ) Γ(n + α R α(t 1 )) Γ(α(t 1, t 1 + η))γ(n + α R α(t 1 ) α(t 1, t 1 + η)) = z α(t 1,t 1 +η) 1 (1 z) n+α R α(t 1 ) α(t 1,t 1 +η) 1 dz 1 (p 2 p 1 )/(1 p 1 ) Γ(n + α R α(t 1 )) Γ(α(t 1, t 1 + η))γ(n + α R α(t 1 + η)) z α(t 1,t 1 +η) 1 (1 z) n+α R α(t 1 +η) 1 dz (3) Take t 1 + η as a value t 2 for ξ p2. 43

44 We apply our algorithm to data on the modulus of rupture(mor) of Douglas Fir specimens observations(sorted) of MOR(lb/in 2 of 2 4 inches, Grade 2, Green(30% moisture content) 44

45 The prior measure α 0 used in our simulation study is the uniform on (0, 9500) and we consider p 1 = 0.05 and p 2 = 0.1, 0.2, 0.5 and Table Basic statistics from 5,000(each) vectors of percentiles generated from the joint posterior distribution. (ξ 0.05, ξ 0.10 ) (ξ 0.05, ξ 0.20 ) Mean vector (2327.9, ) (2322.1, ) sd (263.1, 225.3) (265.8, 268.0) Correlation (ξ 0.05, ξ 0.50 ) (ξ 0.05, ξ 0.95 ) Mean vector (2317.5, ) ( ) sd (267.7, 189.5) (268.1, 248.3) Correlation

46 cdf x Figure: Posterior distribution of 5-th percentile based on a sample of size 1000 standard normal variables.

47 cdf atxi Figure: Posterior distribution of 50-th percentile based on a sample of size 100 standard normal variables. Dots are approximation. 47

48 cdf atxi Figure: Posterior distribution of 50-th percentile based on a sample of size 1000 standard normal variables.

49 Figure: Estimated Joint posterior of 5-th and 10-th percentiles. Based on 10,000 selections using Lumber data density*1,000, % %

50 Figure: Estimated contours of joint posterior of 5-th and 10-th percentiles. Based on 10,000 selections using Lumber data.

51 Histogram of Q10 Q5 Frequency Q10 Q5 Figure: Estimated posterior distribution of Q10 - Q5. Based on 10,000 selections using Lumber data.

52 density*1,000, % % Figure: Estimated Joint posterior of 25-th and 75-th percentiles. Based on 10,000 selections using Lumber data. 52

53 Figure: Estimated contours of joint posterior of 25-th and 75-th percentiles. Based on 10,000 selections using Lumber data.

54 Histogram of IQR Frequency Figure: Estimated posterior distribution for interquartile range. Based on 10,000 selections using Lumber data. IQR

55 55 N(4,1)/N(4,1) Frequency ratio: 5 th percentile Figure: Estimated posterior distribution for ratio of 5-th percentiles. Samples of size 100 N(4, 1). 10,000 generated pairs of percentiles.

56 N(2,1)/N(2,1) Frequency ratio: 5 th percentile Figure: Estimated posterior distribution for ratio of 5-th percentiles. Samples of size 100 N(2, 1). 10,000 generated pairs of percentiles. 56

57 Posterior Distributions-Asymptotics We need the following property of a special density estimator. Lemma 1. Let X 1, X 2,..., X n,... be independently distributed as G and assume that G has a derivative g that is uniformly continuous on R. Let η p satisfy p = G(η p ) and assume that g(η p ) > 0.

58 58 Let G n (t) = #(X j t) + α 0 (t) n + α R and t n = inf{t p G n (t)}. Then for w 0, almost surely, as n ˆf w (t n ) = #(t n < X j t n + w/ n) n w/ n = g(η p ) + o(1) Bertrand-Retali(1978) Silverman(1986) p71 72

59 59 This holds for all (countable) rational w. Remark. By Lemma 1 and the fact that G n (t n ) = p + O(n 1 ) as n, G n (t n + w/ n) p w/ n g(η p ) almost surely, for all rational w 0. For several percentiles p 1 < p 2 <... < p k, we define t in = inf{t p i G n (t)} for i = 1, 2,..., k

60 Theorem Assume that the X 1, X 2,... are independently distributed as G which has a derivative g that is uniformly continuous on R. Further assume that p = (p 1, p 2,..., p k ) has distinct ordered elements and g(η i ) > 0, for i = 1, 2,..., k, where p i = G(η i ), for i = 1, 2,..., k, Then, almost surely on the set where

61 Theorem continued G n (t in + w i / n) p i w i / n as n,we have g(η i ) for i = 1, 2,..., k Pr[ n(ξ p1 t 1n ) w 1,..., n(ξ pk t kn ) w k ] where V g = N k (0, V g ) { pi (1 p j )/[g(η i )g(η j )] i j p i (1 p j )/[g(η i )g(η j )] i > j

62 Correction Terms to Normal Limit Theorem Let Φ and φ be the standard normal cdf and pdf, respectively, and write G n = G n (t n + w/ n). Then, as n, Γ(n + α R ) Γ((n + α R )G n )Γ((n + α R )(1 G n )) 1 p = 1 Φ(k n ) + z (n+α R)G n 1 (1 z) (n+α R)(1 G n ) 1 dz 1 n + αr (1 2G n )(k 2 n 1) 3 G n (1 G n ) φ(k n) + d n n + α R φ(k n ) + O(n 3/2 )

63 63 where n + αr k n = G n (1 G n ) (p G n), k n and d n each converge to constants.

64 cdf x Figure: Posterior distribution of 5-th percentile based on a sample of size 1000 standard normal variables. Correction shown

65 cdf atxi Figure: Posterior distribution of 50-th percentile based on a sample of size 100 standard normal variables. Correction shown

66 cdf atxi Figure: Posterior distribution of 50-th percentile based on a sample of size 1000 standard normal variables. Correction shown

67 67 We next consider the joint distribution of two quantiles. Theorem Let p 1 < p 2 and t i = t in be the solutions of p i = G n (t in ), for i = 1, 2, where G n (t) = #(X j t) + α 0 (t) n + α R Write G 1n = G n (t 1 + w 1 / n) and G 2n = G n (t 2 + w 2 / n).

68 +O(n 1 ) 68 Then Pr[ n(ξ p1 t 1 ) w 1, n(ξ p2 t 2 ) w 2 ] = 1 2π V n 1/2 n+αr (p 1 G 1n ) n+αr (p 2 G 2n ) e 1 2 y V 1 n y dy2 dy 1 1 [ 2z10 z + 20 n + αr z 10 (z 20 z 10 ) I z 20 z 10 (z 20 z 10 )(1 z 20 ) I 01 + z 20(z 20 2z 10 ) 1 3z10 2 (z 20 z 10 ) 2I 30 + (z 20 z 10 ) 2I 21 1 (z 20 z 10 ) 2I 12 + (1 z 10)(1 2z 20 + z 10 ) ] 3(z 20 z 10 ) 2 (1 z 20 ) I 2 03

Chapter 9: Hypothesis Testing Sections

Chapter 9: Hypothesis Testing Sections 9.1 Problems of Testing Hypotheses 9.2 Testing Simple Hypotheses 9.3 Uniformly Most Powerful Tests Skip: 9.4 Two-Sided Alternatives 9.6 Comparing the Means of Two