Statistical Inference with Randomized Nomination Sampling

Size: px

Start display at page:

Download "Statistical Inference with Randomized Nomination Sampling"

Stuart Peters
5 years ago
Views:

1 Statistical Inference with Randomized Nomination Sampling by Mohammad Nourmohammadi A Thesis submitted to the Faculty of Graduate Studies of The University of Manitoba in partial fulfilment of the requirements of the degree of Doctor of Philosophy Department of Statistics University of Manitoba Winnipeg Copyright c 2014 by Mohammad Nourmohammadi

3 Abstract In this dissertation, we develop several new inference procedures that are based on randomized nomination sampling (RNS). The first problem we consider is that of constructing distribution-free confidence intervals for quantiles for finite populations. The required algorithms for computing coverage probabilities of the proposed confidence intervals are presented. The second problem we address is that of constructing nonparametric confidence intervals for infinite populations. We describe the procedures for constructing confidence intervals and compare the constructed confidence intervals in the RNS setting, both in perfect and imperfect ranking scenario, with their simple random sampling (SRS) counterparts. Recommendations for choosing the design parameters are made to achieve shorter confidence intervals than their SRS counterparts. The third problem we investigate is the construction of tolerance intervals using the RNS technique. We describe the procedures of constructing one- and two-sided RNS tolerance intervals and investigate the sample sizes required to achieve tolerance intervals which contain the determined proportions of the underlying population. We also investigate the efficiency of RNS-based tolerance intervals compared with their corresponding intervals based on SRS. A new method for estimating ranking error probabilities is proposed. The final problem we consider is that of parametric inference based on RNS. We introduce different data types associated with different situation that one might encounter using the RNS design and provide the maximum likelihood (ML) and the method of moments (MM) estimators of the parameters in two classes of distributions; proportional hazard rate (PHR) and proportional reverse hazard rate (PRHR) models. i

4 ii

5 To my wife Elaheh and my daughters Dorsa and Tara iii

6 iv

7 Acknowledgement I would like to express my sincere appreciation to my advisors, Dr. Mohammad Jafari Jozani and Dr. Brad C. Johnson, for their support, encouragement, consistently positive outlook and insightful feedback. I would also like to thank Dr. Mary Thompson, Dr. Mahmoud Torabi and Dr. Katherine Davies for serving in my dissertation committee and their valuable comments and suggestions. I wish to thank the Department of Statistics at the University of Manitoba for providing me a wonderful environment for my Ph.D. study. My special thanks go to my wife, Elaheh Heydari, and my daughters, Dorsa and Tara, for their enduring love and support. v

8 vi

9 Abbreviations CDF EM ERSS IID LS MANS MINS ML MM MML PDF PHR PIHR RNS RNS(ρ, ζ) RSS SRS SRSWOR SRSWR Cumulative Distribution Function Expectation-Maximization Extreme Ranked Set Sampling Independent and Identically Distributed Least Squares Maxima Nomination Sampling Minima Nomination Sampling Maximum Likelihood Method of Moments Modified Maximum Likelihood Probability Distribution Function Proportional Hazard Rate Proportional Inverse Hazard Rate Randomized Nomination Sampling Randomized Nomination Sampling with parameters ρ and ζ Ranked Set Sampling Simple Random Sampling Simple Random Sampling Without Replacement Simple Random Sampling With Replacement vii

10 viii

11 Published and Submitted Papers From This Thesis 1. Nourmohammadi, M., Jafari Jozani, M. and Johnson, B. (2014). Confidence interval for quantiles in finite populations with randomized nomination sampling. Computational Statistics and Data Analysis, 73, Nourmohammadi, M., Jafari Jozani, M. and Johnson, B. (2014). Nonparametric confidence intervals for quantiles with randomized nomination sampling. Sankhya, DOI /s Nourmohammadi, M., Jafari Jozani, M. and Johnson, B. (2014). Distribution-free tolerance intervals with randomized nomination samples. Statistical Methodology, Revision Submitted. 4. Nourmohammadi, M., Jafari Jozani, M. and Johnson, B. (2014). Parametric randomized nomination sampling. In preparation. ix

12 x

13 Contents Contents xi List of Tables xv List of Figures xxi 1 Introduction On Rank-Based Sampling Techniques Obtaining a Randomized Nomination Sample Motivation Literature Review Thesis Outline Confidence Intervals for Quantiles in Finite Populations Introduction RNS Replacement Protocols Level 0 RNS Design Level 1 RNS Design xi

14 2.2.3 Level 2 RNS Design A Small Example Confidence Intervals Based on Level 0 RNS Design Optimal Choice of ζ and a Guideline for Choosing r Confidence Intervals Based on Level 1 RNS Design Confidence Intervals Based on Level 2 RNS Design Numerical Study Symmetric Confidence Intervals Equal-tail Confidence Intervals Confidence Intervals for Quantiles in Infinite Populations Introduction Distributional Properties of RNS Samples RNS Confidence Intervals for Quantiles Large Sample Theory Choosing ζ for Symmetric Confidence Intervals On the Optimal Choice of ζ An Illustrative Example A Case Study xii

15 4 Distribution-Free Tolerance Interval Introduction RNS Tolerance Intervals Sample Size Determination RNS in Presence of Ranking Error Numerical Study A Case Study Parametric Inference using RNS Introduction ML Estimation in RNS Complete-Data ML Estimation in the PHR Model ML Estimation in RNS Incomplete-Data RNS-Based MM Estimators Numerical Studies Summary and Future Work 135 Bibliography 141 xiii

16 xiv

17 List of Tables 2.1 Probabilities of possible samples under three RNS protocols when N = 4, m = 2, and P(K = 2) = Probabilities of all possible samples under three RNS protocols in Table 2.1 when ζ {0, 0.5, 1} The values of r, lower and upper coverage probabilities associated with [Y r+1:m, Y m r:m ] and [Y r:m, Y m r+1:m ], respectively, and average length of 95% interpolated symmetric confidence intervals for the first quartile for three introduced population distributions of size N = 100, m = 15, ζ = 0, and K i 1 is a Poisson(0.25) random variable that has been truncated so that K i {1, 2, 3, 4} The values of r and s, length and the observed coverage probabilities (in parentheses) of interpolated 95% equal-tail confidence intervals for the first and third quartiles associated with SRSWOR and Level 0 RNS design with ζ = 0 and ζ = 1, respectively. We considered three population shapes with population sizes N {40, 80, 100}, the sampling fraction f = 20% and K i 1 Bernoulli(0.75) xv

18 2.5 The length and the observed coverage probabilities (in parentheses) of interpolated 95% equal-tail and symmetric confidence intervals for the first (Q 1 ), second (Median) and third (Q 3 ) quartiles associated with Level 0 RNS design with ζ = 0.05, ζ = 0.5 and ζ = 0.95, respectively. We considered the population shape to be LogNormal with population sizes N = 100, the sampling fraction f = 20% for two distributions of K i, K i 1 Bernoulli(0.75) and P(K i = 5) = The length and the observed coverage probabilities (in parentheses) of interpolated 95% equal-tail and symmetric confidence intervals for the first (Q 1 ), second (Median) and third (Q 3 ) quartiles associated with Level 0 RNS design with ζ = 0.05, ζ = 0.5 and ζ = 0.95, respectively. We considered the population shape to be Normal Mixture with population size N = 100, the sampling fraction f = 20% for two distributions of K i, K i 1 Bernoulli(0.75) and P(K i = 5) = The length and the observed coverage probabilities (in parentheses) of interpolated 95% equal-tail and symmetric confidence intervals for the first (Q 1 ), second (Median) and third (Q 3 ) quartiles associated with Level 0 RNS design with ζ = 0.05, ζ = 0.5 and ζ = 0.95, respectively. We considered the population shape to be Normal with population size N = 100, the sampling fraction f = 20% for two distributions of K i, K i 1 Bernoulli(0.75) and P(K i = 5) = Minimum values of sample size m in 95% SRS and RNS(ρ i, ζi ) confidence intervals for Q X,p, i = 1, 2, 3, when p = 0.005, 0.01, and 0.05 with known ρ i and optimum values of ζ calculated as functions of ρ i xvi

19 4.1 Minimum values of the sample needed such that Y 1:m (ζ r = 0) (Y m:m (ζ s = 1)) are lower (upper) (π, γ) SRS- and RNS-based tolerance limitas for four distributions on K. In the imperfect ranking setting the ranking is done by two concomitant random variables with the Kendall s τ = 0.4 (I 1 ) and τ = 0.7 (I 2 ), when π = 0.6(0.1)0.9, 0.95 and 0.99, γ = 0.8, 0.9 and 0.95, B = 50000, and the variable of study follows Normal(1000, ) The range of ζ which improves RNS over SRS for three distributions on K. In imperfect ranking case ranking is done by a concomitant variable with the Kendall s τ = 0.7 (I 2 ), when γ = 0.95, π = 0.6(0.1)0.9 and 0.95, B = 50000, and the variable of study follows Normal(1000, ) Values of r (s) for lower (upper) one-sided (π, γ) tolerance limits of the form [Y r:m (ζ r = 0), ) ((, Y s:m (ζ s = 0)]) for various ranking settings, three distributions on K, π = 0.6(0.1)0.9, γ = 0.9, 0.95, and m = 20, 40 and 70. The imperfect ranking is conducted based on B = when the population shape is LogNormal Values of r (s) for lower (upper) one-sided (π, γ) tolerance limits of the form [Y r:m (ζ r = 0), ) ((, Y s:m (ζ s = 0)]) for various ranking settings, three distributions on K, π = 0.6(0.1)0.9, γ = 0.9, 0.95, and m = 20, 40 and 70. The imperfect ranking is conducted based on B = when the population shape is Mixture Normal Values of r (s) for lower (upper) one-sided (π, γ) tolerance limits of the form [Y r:m (ζ r = 0), ) ((, Y s:m (ζ s = 0)]) for various ranking settings, three distributions on K, π = 0.5(0.1)0.9, γ = 0.9, 0.95, and m = 20, 40 and 70. The imperfect ranking is conducted based on B = when the population shape is Normal xvii

20 4.6 Optimal values of ζ to construct the RNS-based confidence interval for Q π for three distributions on K and π = 0.6(0.1) Values of r (s) for lower (upper) one-sided (π, γ) tolerance limits of the form [Y r:m (ζ), ) ((, Y s:m (ζ)]) for four distributions on K, π = 0.6(0.1)0.9, γ = 0.9, 0.95, m = 20, 40 and 70, and for ζ rs = 0.5 and ζ 0 = 0.8 and ζ = ζ π when the ranking setting is perfect Minimum values of m such that [Y 1:m (ζ), Y m:m (ζ)] are two-sided (π, γ) SRS- and RNS-based tolerance intervals for four distributions on K when the ranking is perfect and ζ rs = 0.5 and ζ 0 = Interval length and values of (r, s) for two-sided (π, γ) tolerance intervals of the form [Y r:m (ζ ), Y s:m (ζ )] for four distributions on K, π = 0.6(0.1)0.9, γ = 0.90 and 0.95, m = 20, 40 and 70, and ζ rs = 0.5, when the ranking is perfect and the population shape is LogNormal Interval length and values of (r, s) for two-sided (π, γ) tolerance intervals of the form [Y r:m (ζ ), Y s:m (ζ )] for three distributions on K, π = 0.6(0.1)0.9, γ = 0.90 and 0.95, m = 20, 40 and 70, and ζ rs = 0.5, when the ranking is perfect and the population shape is Mixture Normal Interval length and values of (r, s) for two-sided (π, γ) tolerance intervals of the form [Y r:m (ζ ), Y s:m (ζ )] for three distributions on K, π = 0.6(0.1)0.9, γ = 0.90 and 0.95, m = 20, 40 and 70, and ζ rs = 0.5, when the ranking is perfect and the population shape is Normal The range of ζ on which RNS (ρ 2, ζ) needs smaller sample size than SRS for ρ 3 = (0.2, 0, 0, 0.8), perfect (P ) and imperfect ranking settings when the auxiliary variable is weight (W ) and length (L), π {0.8, 0.9} and γ = xviii

21 4.13 Values of r (s) for lower (upper) (π, γ) tolerance limits of the form [Y r:m (ζ r = 0), ) ( (, Y s:m (ζ s = 1] ) for perfect (P ) and imperfect ranking settings when the auxiliary variable is weight (W ) and length (L), π {0.8, 0.9}, γ = 0.95 and m = Values of (r, s) for two-sided (π, γ) tolerance intervals of the form [Y r:m (ζ rs = 0.5), Y s:m (ζ rs = 0.5] for perfect (P ) and imperfect ranking settings when the auxiliary variable is weight (W ) and length (L), π {0.8, 0.9}, γ = 0.95 and m = The value of R i introduced in Theorem 5.2 for RNS complete-data and Type I, Type II and Type II incomplete-data xix

22 xx

23 List of Figures 2.1 Computed ρ (SRS,RNS) from Level 0 (L0-solid line), Level 1 (L1-dashed line) and Level 2 (L2-dotted line) RNS protocols and SRSWOR 95% interpolated symmetric confidence intervals for the first (Q 1 ), second (median) and third (Q 3 ) population quartiles, three hypothetical population shapes, ζ [0, 1], N = 100 and m = 15 when K i 1 is Bernoulli(0.75) random variable.the first row shows the shapes of underlying distributions Computed ρ (SRS,RNS) from Level 0 (L0-Solid line), Level 1 (L1-Dashed line) and Level 2 (L2-Dotted line) RNS protocols and SRSWOR 95% interpolated equal-tail confidence intervals for the first (Q 1 ), second (median) and third (Q 3 ) population quartiles, three hypothetical population shapes, ζ [0, 1], N = 100 and m = 15 when K i 1 is Bernoulli(0.75) random variable. The first row shows the shapes of underlying distributions The area of ζ which improves RNS(ρ, ζ)-based symmetric confidence intervals over their SRS counterparts for p (0, 1) xxi

24 3.2 Comparison of p and ρ (p, ζ ) and comparison of m 2r + 1 in SRS, RNS(ρ 1, ζ 1 ), RNS(ρ 2, ζ 2 ), and RNS(ρ 3, ζ 3 ), where ζ i is the optimum value of ζ as functions of ρ i, i = 1, 2, 3, m = 45 and r is chosen as the largest value for which the coverage probability exceeds the nominal 0.95 level Comparison of the coverage probabilities of exact confidence intervals for p [0, 1] in SRS, RNS(ρ 1, ζ 1 ), RNS(ρ 2, ζ 2 ), and RNS(ρ 3, ζ 3 ), where ζ i is the optimal value of ζ as a function of ρ i, i = 1, 2, 3. The nominal confidence level is Comparison of the coverage probabilities of confidence intervals for p = 0.005, 0.01, and 0.05 in SRS, RNS(ρ 1, ζ 1 ), RNS(ρ 2, ζ 2 ), and RNS(ρ 3, ζ 3 ), ζ i is the optimal value of ζ calculated as a function of ρ i, i = 1, 2, 3. The nominal confidence level is Comparison of the average coverage probability of the confidence interval [Y 1:m (ζ i ), Y m:m(ζ i )] in SRS, RNS(ρ 1, ζ 1 ), RNS(ρ 2, ζ 2 ), and RNS(ρ 3, ζ 3 ), where ζ i is the optimal value of ζ calculated as a function of p and ρ i, i = 1, 2, Average confidence interval lengths for the SRS and RNS(ρ i, ζ i,0.75 ) designs (first row) and the SRS and RNS(ρ i, ζi,0.75 o ) designs (second row), i = 1, 2, 3, under perfect ranking and imperfect ranking 1 and 2 scenarios The difference between the simulated and expected coverage probabilities for the SRS and RNS(ρ i, ζi,0.75 ) designs (first row) and the SRS and RNS(ρ i, ζi,0.75 o ) designs (second row), i = 1, 2, 3, under perfect ranking and imperfect ranking 1 and 2 scenarios Mercury levels, weight and length of Sander vitreus in the dataset and associated scatterplots xxii

25 5.1 The relative precision of the ML estimators of γ(θ) based on the RNS completedata over their SRS counterparts in the PHR (left panel) and PRHR (right panel) models when K = 2, 3, 4, 5 and the proportion of maximums is p = 0(0.1) The relative precision of the RNS incomplete-data Type I (top left panel) for ζ [0, 1] and λ {1, 2, 3, 4}, Type II (top right panel) for four distributions on K and λ = 1(0.1)5, and Type III (middle and down panels) for ζ {0, 0.25, 0.75, 1} and λ = 1(0.1)5 in an exponential distribution with parameter λ and m = The relative precision of the RNS incomplete-data Type I (top left panel) for ζ [0, 1] and η {1, 2, 3, 4}, Type II (top right panel) for four distributions on K and η = 1(0.1)5, and Type III (middle and down panels) for ζ {0, 0.25, 0.75, 1} and η = 1(0.1)5 in an exponential distribution with parameter η and m = The relative precision of the MM estimators of γ(θ) based on the RNS complete-data over their SRS counterparts in the PHR (left panel) and PRHR (right panel) models when k i = 2, 3, 4, and 5, and the proportion of maximums is p = 0.1(0.1) The relative precision of the RNS incomplete-data Type I (top left panel) for ζ [0, 1] and λ {1, 2, 3, 4}, Type II (top right panel) for four distributions on K and η = 1(0.1)5, and Type III (middle and down panels) for ζ {0, 0.25, 0.75, 1} and η = 1(0.1)5 in a beta distribution with parameter 1/η, the shape parameter β = 1, and m = xxiii

26 5.6 The relative precision of the RNS incomplete-data Type I (top left panel) for ζ [0, 1] and η {1, 2, 3, 4}, Type II (top right panel) for four distributions on K and η = 1(0.1)5, and Type III (middle and down panels) for ζ {0, 0.25, 0.75, 1} and η = 1(0.1)5 in a beta distribution with parameter 1/η, the shape parameter β = 1, and m = xxiv

27 Chapter 1 Introduction 1.1 On Rank-Based Sampling Techniques One of the important practical problems in statistics is to obtain more representative samples from the underlying population and give an informative image of the variable of interest. Providing more structured samples than those obtained in a simple random sampling (SRS) design is one way to address this issue. Rank-based sampling schemes utilize easily obtained rank-based information related to the variable of interest in the population to provide an artificially stratified sample and direct our attention toward the more representative units in the population. Ranked set sampling (RSS) is a rank-based sampling approach to data collection which was first proposed by McIntyre (1952) for the situations where taking the actual measurements on the attribute of interest from the sampling units is time-consuming, destructive, or costly, but a small set of sample units can be ordered with respect to the variable of interest, without actual measurements of them, relatively easy and with negligible cost. In RSS, a set of k items is drawn from the population, the items of the set are ranked by judgement, and only the item judged the smallest is quantified. Then another set of size k is drawn 1

28 and ranked, and only the item judged the second smallest is quantified. The procedure continues until the item judged the largest in the k-th set is quantified. This completes a cycle of the sampling. The cycle is then repeated for as many times as desired. This form of RSS is referred to as balanced in the sense that each order statistic in the ranked sets is quantified the same number of times. In the unbalanced RSS scheme, a cycle of k sets of size k is drawn and ranked, and the order statistics with orders r 1,..., r k are quantified in these k ranked sets, where the r j s can be any integer from 1 to k and are not necessarily different. For more information on RSS see Chen et al. (2004) and Wolfe et al. (2004). Randomized nomination sampling (RNS) is also a rank-based sampling technique which has been introduced by Jafari Jozani and Johnson (2012) for estimating the mean value of the characteristic of interest in finite populations. This thesis concerns parametric and nonparametric inference based on the RNS design. To this end, in this chapter we discuss RNS design and some of its applications. We show that several commonly used rank-based sampling techniques can be written as special cases of the RNS design. 1.2 Obtaining a Randomized Nomination Sample Let {K i : i N} be a sequence of independent random variables taking values in {1,..., M} with probabilities ρ = {(ρ 1, ρ 2,..., ρ M ) : M i=1 ρ i = 1}. When the underlying population is finite with size N, we have 1 M N. The value of M is defined by the context, as will be seen later. Further, let {Z i : i N} be a sequence of independent Bernoulli random variables with success probability ζ [0, 1], independent of K i. First, an initial simple random sample of K 1 units from the population is selected; this sample is called a set. Then, the sampled units in the set are ordered based on the attribute of interest via some ranking process so that 2

29 the smallest and the largest units are identified. This judgement ranking is accomplished using a variety of mechanisms, including expert opinion, visual comparisons, or the use of an easy-to-obtain auxiliary variable. Note that the accomplished ranking can be perfect or imperfect. When the judgement ranking of the K 1 units in the set is accomplished, the first unit in the sample is selected based on the value of Z 1. We select the smallest or the largest unit in the set if Z 1 = 0 or Z 1 = 1, respectively. Then the attribute of interest is measured on the selected unit. This process is repeated m times to obtain an RNS sample of size m. We refer to this design as an RNS design with parameters ρ and ζ, or RNS(ρ, ζ) design for short. Suppose X is the variable of interest. Within the i-th set, the selected nominee for the final measurement can be written as Y i := Y (X, K i, Z i ) = (1 Z i )X 1:Ki + Z i X Ki :K i, where X, K i and Z i are independent and X r:k refers to the r-th order statistic in a sample of size K. Throughout the thesis, we use X r:k and X [r:k] to distinguish between the r-th order statistics in a sample of size K under perfect and imperfect ranking scenarios, respectively. RNS-based statistical inference may be made under the assumption that the value y i is observed and k i and z i, where i = 1,..., m, are unknown. However, there might be situations in which the size of sets or the number of maximums (and subsequently the number of minimums), or both are chosen in advance, instead of being chosen based on a randomized process. The cumulative distribution function (CDF) of Y i can be found by conditioning on K i = k i and Z i = z i, or both. The conditioning argument makes the theoretical investigation more complicated, but it provides more efficient statistical inference. In this thesis, both unconditional and conditional RNS are presented. Some well-known examples of randomized nomination sampling are given below: (1) The choice of K i = 1, i = 1,..., m, results in the SRS design. Throughout this thesis the simple random sample is denoted by X i, i = 1,..., m. (2) The choice of ζ = 1 nominates the maximum from each set and results in a maxima 3

30 nomination sampling (MANS) design (see Willemain, 1980, and Jafari Jozani and Mirkamali, 2010). (3) The choice of ζ = 0 nominates the minimum from each set and results in a minima nomination sampling (MINS) design (see Tiwari and Wells, 1989). (4) The choice of ζ = 1 2 and k i = k, for a constant k N, results in a randomized extreme ranked set sampling (RERSS) design (see Jafari Jozani and Johnson, 2012). With slight modifications the results can be used in the following designs: (5) The choice of Z 2i 1 = 1 and Z 2i = 0, and K i = k, where i = 1,..., m, for a constant k N and an even number m, results in an extreme ranked set sampling (ERSS) design (see Samawi et al., 1996, and Ghosh and Tiwari, 2009) (6) The choice of Z 2i 1 = 1 and Z 2i = 0, and K i = i, where i = 1,..., m, for an even number m results in a moving extreme ranked set sampling (MRSS) design (see Al-Odat and Al-Saleh, 2001 ). Note that Z i, i = 1,..., m, in (5) and (6) are no longer IID. 1.3 Motivation Willemain (1980) introduced the MANS design with a fixed set size k (i.e., ζ = 1 with ρ k = 1) to estimate the median of a population in the situation where it is practically impossible to draw a simple random sample. In his application, a federal agency was interested in more closely relating reimbursement amounts provided to nursing home services to the actual cost of services provided. Since complete enumeration of costs for all patients in every nursing home would be prohibitively expensive, sampling methods were considered. However, it was felt that the nursing home operators would be more willing to accept sample-based rates 4

31 if they were given a role in selecting the cases chosen for assessment. The operators were allowed to make more likely the selection of those residents whose care they felt to be most expensive. Willemain (1980) described the procedure where one would select m independent sets of size k and, for each set, allow the nursing home operator(s) to select the patient they judged to have the most expensive care requirements. Since Willemain (1980), maxima nomination sampling has been the topic of numerous papers; more details are presented in Section 1.4. Jafari Jozani and Johnson (2012) showed that MANS design may not necessarily be the optimal design choice for estimating the population total. They suggested to use the observations from both extremes, maximums and minimums, and give more flexibility to the design by considering the set size to be random. The method proposed by Jafari Jozani and Johnson (2012) also provides a unified way of analyzing MANS and MINS designs. Efficiency of statistical inference using rank-based sampling techniques relative to their SRS counterpart, with the same number of quantified units, depends upon the underlying population, ranking accuracy, and set sizes. One concern in working with extremes is when the distribution of the variable of interest is skewed. RNS has the flexibility to adjust the proportion of maximums and minimums by choosing the appropriate ζ [0, 1] based on the shape of the underlying distribution. Regarding the ranking accuracy, unlike regular RSS, RNS allows for an increase of the set size without introducing too much error. Identifying the extremes, rather than a complete ranking, is more practical, since we need to successfully identify only the first and/or the last ordered unit. In terms of the set size, while the RNS design does not preclude the use of fixed set sizes (by taking P(K i = k) = 1 for i = 1,..., m and some fixed k), allowing for random set sizes provides additional flexibility in the design. There are circumstances in which the set size might naturally be random. For example, reliability experiments quite often give rise to an RNS design and the statistical inference using RNS can be easily applied in a system reliability context to make inference on the 5

32 lifetime of a system with several components in parallel or in series. If K is the number of components in a parallel (series) system and X is the system lifetime, then repeated observations on the pair (X, K) constitute a maxima (minima) nomination sample. In this context, one notes that K might be random or non-random, depending on the application. A user, wishing to control the level of redundancy in a system, might use K as a design parameter. Life-testing experiments carried out by a manufacturer might involve a variety of K values chosen according to some fixed scheme. On the other hand, consumers often have the option of choosing the value of K, perhaps implicitly. Thus a random sample of consumers would result in a randomized nomination sample with random K. Gemayel et al. (2010) provides more discussion on random set sizes in RSS. Another advantage in allowing random set sizes is that, when ρ 1 > 0, we have, on average, mρ 1 observations which comprise a simple random sample. Indeed, on average, RNS samples will contain mζρ k maximums from sets of size k and m(1 ζ)ρ k minimums from sets of size k for k = 1,..., M. Thus, in addition to the simple random sample portion of the RNS sample, we also have a collection of extremal order statistics from various set sizes, which can contain much more information about the population than SRS observations. The RNS scheme has the ability to apply the random set size in the application. The studies on the RNS design presented in this thesis were motivated by the work of Jafari Jozani and Johnson (2012). We investigate the performance of the RNS design in making parametric and nonparametric statistical inference in both finite and infinite populations. 1.4 Literature Review After Willemain (1980) introduced MANS, Boyles and Samaniego (1986) used this design to examine the problem of nonparametric estimation of the population CDF in MANS 6

33 by relaxing the assumption that K i are fixed. In this case, the units obtained from the MANS design are generated from the same parent distribution, but they are not identically distributed unless the m sets from which they were obtained are of the same size. Considering a random set size, Boyles and Samaniego (1986) derived the nonparametric ML estimator of the population CDF and established the strong consistency of their estimator. They also identified the asymptotic distribution of their estimator. They applied their results on a real data set from the Natural Environmental Research Council of Great Britain in 1975 consisting of the number of floods per year of the Nidd River in England and the discharge, in cube meters per second, corresponding to the maximum flood. They also demonstrated the strong consistency of the Willemain s estimate of the median, as a byproduct of their development. Tiwari (1988) derived the Bayes estimator of the population CDF (F ) in the MANS design. Considering the Dirichlet distribution as the prior distribution, he showed that under some conditions the Bayes estimator was identical to the nonparametric ML estimator derived by Boyles and Samaniego (1986). Later, Tiwari and Wells (1989) presented a quantile estimator using maxima nomination samples based on the ML estimator derived by Boyles and Samaniego (1986). Wells et al. (1990) considered another important case in which the nominee is the minimum of the sets and called it MINS. Kvam and Samaniego (1993) proposed an alternative nonparametric estimator of the population CDF using the least squares approach in MANS. This approach was also used by the same authors for estimating the CDF using RSS. Kvam and Samaniego (1993) concluded that if the cost to obtain maximums is no more than the cost to obtain regular sample units, then the least squares (LS) estimator of the CDF within the range of the maxima nomination sample is preferred over the empirical distribution function (EDF) of an independent and identically distributed (IID) sample when the primary interest is the upper portion of the 7

34 unknown distribution. The conducted simulation showed that, with minor adjustments, the LS estimator performs equally well using the observed minima rather than maxima. Jafari Jozani and Mirkamali (2010) provided an application of MANS in constructing better acceptance sampling plans for attributes of interest than the usual ones based on SRS. Acceptance sampling uses statistical sampling to determine whether to accept or reject a production lot of material. It has been a common quality control technique used in industry. An important area in quality control research is control charting for attributes to monitor the number or proportion of nonconforming items, i.e., the items which do not satisfy some conditions in the process. Jafari Jozani and Mirkamali (2011) showed that using the MANS technique in constructing attribute control charts leads to a substantial improvement over the usual control charts for attributes based on SRS. RNS was first introduced by Jafari Jozani and Johnson (2012) who provided design based estimation procedures for use in finite population sampling and showed some optimality results over SRS and strict maxima (minima) nomination sampling schemes. They provided the first- and second-order inclusion probabilities of the population elements for both with and without replacement variations. Minimizing the variance of the Hansen-Hurwitz estimator, they stated that, under perfect ranking assumption, there is always an RNS design with ζ > 1 2 that outperforms every RNS design with ζ 1 2. In particular, MINS and a randomized version of ERSS with ζ = 1 2 cannot be optimal in the finite population setting and MANS always performs better than MINS. They showed that the RNS design can perform better than the SRS design in estimating the population total for many population types. Moreover, they provided the relative efficiencies for the RNS with and without replacement designs with respect to a SRS without replacement design for the optimal choice of design parameters for both perfect and imperfect ranking scenarios. 8

35 1.5 Thesis Outline The outline of the thesis is as follows. In Chapter 2, we discuss three different ways of constructing an RNS design in finite populations using various replacement policies. We also discuss the construction of confidence intervals for population quantiles. Several theoretical results are presented in this chapter. We also provide guidelines for choosing the design parameter ζ which lead to shorter confidence intervals for specific population quantiles compared with their SRS counterparts. We develop recursive algorithms that can be used to obtain the confidence coefficient associated with the confidence intervals. Numerical studies are conducted to evaluate the performance of the RNS-based symmetric and equal-tail confidence intervals compared with their counterparts based on the SRS design. Chapter 2 also contains a discussion on the effect of the fixed set size and conditional results given the selected unit is maximum or minimum, on the length of the constructed confidence intervals. Chapter 3 describes the RNS design and presents some basic distribution theory of the random variables associated with this design when the underlying population is absolutely continuous. We study the construction of the exact and asymptotic RNS confidence intervals for quantiles. We show a duality between the problem of constructing an RNS confidence interval for the p-th quantile of the underlying population with the one for the p -th quantile of the population based on SRS, where p and p [0, 1], and p is obtained as a function of p and the parameters of the RNS design. We compare the RNS and SRS confidence intervals based on their length and coverage probabilities and the necessary sample size to achieve the desired coverage probabilities in each design. It is observed that the design parameter ζ associated with RNS provides a flexible tool which enables one to construct confidence intervals with the exact desired coverage probabilities for a wide range of population quantiles without the use of randomized procedures. A case study based on a small livestock data set for both perfect and imperfect ranking situations is also provided. 9

36 In Chapter 4, distribution-free RNS-based tolerance intervals are studied and some of the necessary theoretical results are derived. We study the performance of the proposed RNS tolerance intervals based on the corresponding coverage probabilities and the necessary sample size for their existence with those based on SRS. The efficiency of the constructed RNS-based tolerance intervals compared to the SRS counterparts is discussed. We investigate the performance of RNS-based tolerance intervals for different values of the design parameters and various population shapes. We find the values of the design parameters which makes the RNS-based tolerance intervals to be superior to their SRS-based counterparts in terms of the sample size needed to construct the tolerance intervals. The RNS design in the presence of ranking error is also discussed and a new method for estimating ranking error probabilities is proposed. Chapter 5 examines the use of the RNS design in the estimation of the distribution parameters θ in proportional hazard rate (PHR) models and proportional reverse hazard rate (PRHR) models. We consider both the method of moments (MM) and ML estimation of θ. Depending on what values of y i, k i, and z i are known, four types of RNS data will be defined. The RNS complete-data, in which the triplet y i, k i and z i in the selected units are known, is investigated in details and the expected values and the variances of the estimators are obtained. It is shown that there is always a value of the design parameter ζ by which the RNS design is more efficient than the SRS design. The EM algorithm is used to compare the RNS-based estimators with their SRS counterparts when the data is not complete. In Chapter 6, we conclude with a summary of what was accomplished in Chapters 2 to 5. We also describe some future works that could be carried out in the areas related to those discussed in this dissertation. 10

37 Chapter 2 Confidence Intervals for Quantiles in Finite Populations 2.1 Introduction There are many cases in which one may be interested in the quantiles of a distribution. For example, making inference on population quantiles may be of interest rather than a population mean when the underlying distribution is highly skewed, as quantiles are less influenced by extreme observations. In this chapter, we study the problem of constructing confidence intervals for population quantiles under different RNS designs for finite populations. In recent years, many researchers have considered similar problems for finite and infinite populations under different rank-based sampling designs. For example, under the RSS design, Ozturk and Deshpande (2006) proposed distribution-free confidence intervals for quantiles of infinite populations, and showed that RSS-based intervals tend to be shorter than their counterparts based on SRS. Later, Deshpande et al. (2006) developed nonparametric RSS-based confidence intervals for quantiles of finite populations. For recent developments in this direction, see Frey (2007a), Ozturk (2012) and the references therein. In other applications, construction of confidence intervals for quantiles has been the topic of the 11

38 papers by Philip and Lam (1997) for the assessment of the status of hazard waste sites, Kvam (2003) for monitoring water quality, Murff and Sager (2006) for evaluating mercury contamination in fish, and Burgette and Reiter (2012) for studying the adverse effects of tobacco smoke exposure of mothers during the pregnancy on health indices of infants. In the finite population setting, the construction of a randomized nomination sample can be done in different ways. It is usual to assume that sets are drawn without replacement from the underlying population. However, different replacement policies for the measured and ranked units in a set, prior to the selection of the units in the next set, result in different RNS designs. Following Deshpande et al. (2006), we consider three without replacement RNS techniques denoted by Level 0, Level 1, and Level 2 RNS design, respectively. In the Level 0 RNS design, sets are drawn without replacement, but all units in the set, including the measured unit, are replaced back into the population prior to selection of the next set. In the Level 1 RNS design, all units in the set, except the unit selected for full measurement, are replaced back into the population. If none of the units from the sets are replaced back into the population before drawing the next set, then we call this the Level 2 RNS design. Jafari Jozani and Johnson (2012) developed recursive algorithms to obtain the first and second order inclusion probabilities for population units under the Level 0 and Level 1 RNS sampling designs. In Section 1.3, we briefly discussed the advantages of random set sizes in RNS. In addition, when ρ 1 > 0 is moderately large, as proposed in Chapter 4, after observing the RNS sample we can bootstrap its SRS portion to estimate the ranking error probabilities in an imperfect RNS design. One may also want to choose the number of maximums (and so the minimums) in advance, instead of using a randomized process; this can be accomplished following a conditioning argument on Z i s (see Section 2.6). Despite the complexity of making inference based on conditioning on Z i = z i after randomization, the conditioning 12

39 argument may lead to better results. However, the proportion of required maximums in this setting would be another concern requiring attention. This concern can be addressed using the results we obtain in the randomized setting. The outline of this chapter is as follows. In Section 2.2, we discuss three different ways of constructing an RNS design in finite populations using the replacement policies Level 0, Level 1 and Level 2. Section 2.3 deals with the construction of confidence intervals for population quantiles under Level 0 RNS design. Several interesting theoretical results are presented in this section. Also, we provide a guideline for choosing the design parameter ζ in Level 0 RNS design to obtain more efficient confidence intervals for specific population quantiles compared with its SRS counterpart. In Sections 2.4 and 2.5, we develop recursive algorithms that can be used to obtain the confidence coefficients associated with Level 1 and Level 2 RNS confidence intervals, respectively. In Section 2.6, numerical studies are conducted to evaluate the performance of the RNS based symmetric and equal-tail confidence intervals compared with their counterparts based on SRS design. Section 2.6 also contains a discussion on the effect of the fixed set size and conditional results given Z i = z i, i = 1,..., m, on the length of the constructed symmetric and equal-tail confidence intervals. 2.2 RNS Replacement Protocols In this section, we describe three protocols for drawing randomized nomination samples from the finite population U. We assume that ranking of the units in each set is done based on an auxiliary variable. To set the notation, suppose we have a finite population of N elements, labeled U = {1,..., N}, consisting of bivariate pairs (x 1, u 1 ),..., (x N, u N ), where X is the study variable and U is an auxiliary variable, known to us in advance and assumed to be positively (or negatively) associated with X. Without loss of generality, we assume the u i, i = 1,..., N, are distinct; otherwise a random noise ɛ i should be added to 13

40 u i, to make them unique. Let u (i), (i = 1,..., N), denote the i-th ordered u value in the population and x [i] denote the x value associated with u (i), so that the ordered population may be written as (x [1], u (1) ),..., (x [N], u (N) ). Similarly, in a simple random sample of size k, we will denote the i-th ordered u value as u i:k and the associated x value as x [i:k]. For each RNS protocol described below, the final sample can be represented as a set of vectors {(y i, k i, z i ), i = 1,..., m}, where y i represents the final measurement obtained from the set i, k i is the set size and z i {0, 1}. The value z i = 1 indicates that y i is the measurement obtained from the unit associated with x [ki :k i ] which is ranked the largest among the k i units in the set, while z i = 0 shows that y i is obtained from the unit associated with x [1:ki ] which is ranked the smallest in the set, so that, y i = z i x [ki :k i ] + (1 z i )x [1:ki ] Level 0 RNS Design In the Level 0 RNS design, given K i = k i and Z i = z i, we first draw a simple random sample without replacement of size k i from the population and then we rank the units in the set without actual measurement on the variable of interest. We then select the unit ranked k i when z i = 1 or the unit ranked 1 when z i = 0 within the ordered set for the full measurement. We then return all the units back to the population before drawing the next set for the next measurement. In other words, the Level 0 RNS protocol allows the same unit to appear more than once in the final sample as well as the sets used for ranking purposes. The observations y i s therefore are marginally independent for this protocol. Here, there is no restriction on the set size and we can simply let M to be equal to the population size N, although in practice, small values of M are recommended. We can state the Level 0 RNS design in algorithmic form as follows. For each i = 1,..., m: Step I: Observe K i = k i using the probability model ρ = {(ρ 1, ρ 2,..., ρ M ) : 0 ρ i 14

41 1, M i=1 ρ i = 1}. Step II: Draw a simple random set without replacement of size k i from the underlying population. Step III: Using the auxiliary variable, rank the set units to identify the minimum to the maximum. Step IV: Select the minimum (with probability 1 ζ) or the maximum (with probability ζ) from the ranked set and measure the study variable for this unit. Step V: Return all k i units back to the population Level 1 RNS Design The algorithm of this level is similar to the Level 0, except that we only return the k i 1 units not measured in Step IV. In other words, we do not allow the same unit to appear more than once in the final sample, and so observations are not independent. As a result, we will see in the next section that finding the confidence interval for quantiles under this protocol is more complicated. In this level, the unit which is selected from the population is removed, and to be able to draw a randomized nomination sample of size m from the underlying population, we are forced to restrict the set size K i to be less than M m + 1, for i = 1,..., m Level 2 RNS Design In the Level 2 RNS design, the algorithm is similar to the Level 0, except that none of the units selected in each set are returned back to the population after we take a measurement from it (i.e., in Step V). That is, given K i = k i, all k i units in the set are removed from the 15

42 population prior to the next selection. Similar to Level 1, units cannot be selected more than once for full measurement in the final sample. However, because of the strict insistence on sampling without replacement both for the ranking and measurement purposes, the Level 2 sample can only be drawn under more restrictive conditions on the set size K i. For this design, to be able to select a randomized nomination sample of size m from the finite population U, we must have P(K i N/m) = 1, for i = 1,..., m A Small Example To show the difference between these sampling protocols, in the following example, we consider a simple example and obtain all possible randomized nomination samples and their probabilities under the various sampling protocols from the underlying finite population. Example 2.1. Suppose U = {1, 2, 3, 4}, and without loss of generality, assume that the values of the study variable are such that x 1 < x 2 < x 3 < x 4. Consider a simple case in which m = 2 and P(K i = 2) = 1, for i = 1, 2. Table 2.1 shows all possible randomized nomination samples of size m = 2 and their probabilities (as a function of ζ) under various RNS sampling protocols from U. From Table 2.2, we can easily observe that Level 0, Level 1 and Level 2 RNS designs assign substantially different probabilities to the possible samples depending on the value of ζ. The effect of the without replacement policy in the Level 2 RNS protocol can be seen from the fact that, when ζ = 0, only two distinct sets of samples are possible, (x 1, x 2 ) and (x 1, x 3 ). Considering ζ = 1, the only samples that have positive probabilities of being chosen under the Level 2 RNS protocol are (x 2, x 4 ) and (x 3, x 4 ). The set of possible samples is larger under the Level 1 and Level 0 RNS designs. 16

43 Table 2.1: Probabilities of possible samples under three RNS protocols when N = 4, m = 2, and P(K = 2) = 1. Samples Level 0 Level 1 Level 2 9 x 1, x 1 36 (1 ζ) x 1, x 2 36ζ(1 ζ) + 36 (1 ζ)2 36 (1 ζ) ζ(1 ζ) (1 ζ)2 12 x 1, x 3 36 ζ(1 ζ) (1 ζ)2 36 (1 ζ) ζ(1 ζ) (1 ζ) ζ(1 ζ) 18 x 1, x 4 36 ζ(1 ζ) ζ(1 ζ) 24 36ζ(1 ζ) 4 x 2, x 2 36 ζ(1 ζ) (1 ζ) ζ x 2, x 3 36 (1 ζ) ζ(1 ζ) ζ2 36 ζ(1 ζ) ζ (1 ζ)2 36ζ(1 ζ) 12 x 2, x 4 36 ζ(1 ζ) ζ2 36 ζ ζ(1 ζ) ζ(1 ζ) + 36 ζ2 1 x 3, x 3 36 (1 ζ) ζ(1 ζ) ζ x 3, x 4 36ζ(1 ζ) + 36ζ2 36 ζ ζ(1 ζ) ζ2 9 x 4, x 4 36 ζ2 0 0 Table 2.2: Probabilities of all possible samples under three RNS protocols in Table 2.1 when ζ {0, 0.5, 1}. ζ ζ = 0 ζ = 0.5 ζ = 1 Sample Level 0 Level 1 Level 2 Level 0 Level 1 Level 2 Level 0 Level 1 Level x 1, x x 1, x x 1, x x 1, x x 2, x x 2, x x 2, x x 3, x x 3, x x 4, x Confidence Intervals Based on Level 0 RNS Design Suppose Y 1, Y 2,..., Y m is a sample of size m drawn from the finite population U (consisting of N elements) using the Level 0 RNS design with design parameters ρ and ζ. Let the ordered values of the study variable for the population elements be denoted by x (1) < x (2) <... < x (N). Letting t be a fixed integer, t {1,..., N}; it is desired to obtain a confidence interval for x (t), the (t/n)-th quantile of U, where P(X < x (t) ) t/n P(X x (t) ). Let 17

Real-Time Software Transactional Memory: Contention Managers, Time Bounds, and Implementations

Real-Time Software Transactional Memory: Contention Managers, Time Bounds, and Implementations Mohammed El-Shambakey Dissertation Submitted to the Faculty of the Virginia Polytechnic Institute and State