Gage repeatability & reproducibility (R&R) studies are widely used to assess measurement system

QUALITY AND RELIABILITY ENGINEERING INTERNATIONAL Qual. Reliab. Engng. Int. 2008; 24:99 106 Published online 19 June 2007 in Wiley InterScience (www.interscience.wiley.com)..870 Research Some Relationships between Gage R&R Criteria William H. Woodall 1,, and Connie M. Borror 2 1 Department of Statistics, Virginia Tech, Blacksburg, VA 24061-0439, U.S.A. 2 Arizona State University West, Glendale, AZ 85306, U.S.A. In this paper, several commonly used gage repeatability and reproducibility (R&R) acceptance criteria are discussed. The number of distinct categories (ndc) is equivalent to an acceptance criterion based on another standard metric, the ratio of the estimated measurement system standard deviation to the estimated overall standard deviation of the measurements. This implies that the criterion based on ndc could be considered redundant. Several acceptance criteria are revisited and compared, including a discussion of more objective measures of capability based on misclassification rates. The relationship between ndc and the discrimination ratio is also given. Copyright 2007 John Wiley & Sons, Ltd. Received 1 September 2006; Accepted 30 October 2006 KEY WORDS: gage R&R study; gauge studies; measurement system analysis; misclassification rates; variance components 1. INTRODUCTION Gage repeatability & reproducibility (R&R) studies are widely used to assess measurement system variation relative to process variation and tolerance limits. It is typically assumed that σ 2 T = σ2 P + σ2 where σ 2 T is the variance of the measurements, σ2 P is the variance of the process, and σ2 is the variance of the measurement process. Measurement capability metrics can be categorized as (1) those that compare measurement variation to total or part variation and (2) those that compare the measurement variation to tolerance width. The measurement system variance σ 2 is usually broken up into components corresponding to R&R, but this decomposition is not relevant to the discussion here. We will primarily concentrate on measurement capability metrics comparing measurement variation to total or part variation. 1.1. Notation Using notation and descriptions similar to Burdick et al. 1,2, the variance components and important ratios of variance components are reported in Table I. It should also be noted that the word gage is used Correspondence to: William H. Woodall, Department of Statistics, Virginia Tech, 406-A Hutcheson Hall, Blacksburg, VA 24061-0439, U.S.A. E-mail: bwoodall@vt.edu Copyright q 2007 John Wiley & Sons, Ltd.

100 W. H. WOODALL AND C. M. BORROR Table I. Parameters and descriptions Parameter σ 2 P σ 2 σ 2 T = σ2 P + σ2 ρ P = σ 2 P /σ2 T ρ M = σ 2 /σ 2 T γ = σ 2 P /σ2 Description Variance of the process (often referred to as part-to-part variability) Variance of the measurement system (repeatability and reproducibility) Total variance of the response Proportion of total variance due to the process Proportion of total variance due to the measurement system Ratio of process variance to measurement system variance interchangeably with the spelling gauge in the literature. Both spellings are acceptable, and the reader will see both spellings in this paper. 2. MEASUREMENT CAPABILITY METRICS There are numerous criteria used to assess the capability of a measurement system. In this section, the emphasis is on the capability metrics involving the comparison of measurement variation to total or part variation. However, a metric that involves the use of specification limits is discussed and illustrated for comparison purposes. 2.1. %R&R criterion The estimated variance components are obtained from the gage R&R study. One basic criterion for the acceptability of the measurement system is that ˆσ/ˆσ T or ˆρ M, commonly referred to as the Total gage R&R %Study Var, should be suitably small. (It should be noted that the title %Study Var as provided in statistical packages, refers to the per cent study variation which involves ratios of standard deviations and not ratios of variances as the name %Study Var may lead the reader to believe). According to AIAG (p. 77) 3 a value of ˆρ M < 0.1 is considered to be acceptable. Values between 0.1 and 0.3 may be acceptable depending on factors such as the importance of the application, the cost of the measurement device, and the cost of repair. Values over 30% are generally considered unacceptable and it is recommended that every effort should be made to improve the measurement system. 2.2. Number of categories criteria The number of distinct categories that a measurement system can readily identify is often used as a capability metric. There are several metrics that report the number of categories. One such metric, recommended by AIAG (p. 117) 3 for use in the final stage in the analysis of gage R&R data, is the number of distinct categories (ndc). This is defined as the number of non-overlapping 97% confidence intervals for the true value of the measured characteristic that will span the expected product variation. It is also referred to as the ndc that can be reliably distinguished by the measurement system. The formula given by AIAG (p. 117) 3 and used, for example, by MINITAB can be written as ndc = 1.41 ˆγ (1) The value of ndc is truncated to give an integer. The rule of thumb is that ndc should be at least 5 for the measurement system to be acceptable. According to AIAG (p. 117) 3, Wheeler and Lyday 4 referred to ndc as the classification ratio.

SOME RELATIONSHIPS BETWEEN GAGE R&R CRITERIA 101 The formula in Equation (1) can be obtained using the fact that the width of a confidence interval on the true value based on a single measurement is of width 2 (2.17), where 2.17 is the 98.5th percentile of the standard normal distribution, and the assumption that the process variation is captured by an interval of width 6.12ˆσ P. The width of the confidence interval, however, depends on the number of degrees of freedom (df) associated with the estimation of the measurement error variance. The df resulting from a gage R&R study with p parts, o operators, and r replications is po(r 1). The form and example from AIAG 3 has p = 10, o = 3, and r = 3, with df = 60. With the confidence interval based on a t-distribution with 60 df and the process variation assumed to be captured by ±3.09ˆσ P, then the constant in Equation (1) would be 1.39 instead of 1.41. The ndc has also been reported as the signal-to-noise ratio (SNR) (see Burdick et al. 1,2,AIAG 5,and Larsen 6 ), and can be written as SNR = 2ˆγ (2) It should be noted that in AIAG 3, the SNR no longer involves 2 and is given as SNR = ˆγ Still, a third metric that can be used to determine the number of categories that the measurement system is capable of distinguishing is the discrimination ratio (D R ). As stated in AIAG (p. 113) 3, The number of data categories is often referred to as the discrimination ratio since it describes how many classifications can be reliably distinguished given the observed process variation. The value of D R, as defined by Wheeler 7 and Mader et al. 8, can be written as 1 +ˆρ D R = P 1 ˆρ P Burdick et al. 1 incorrectly reported the D R formula as shown above but omitted the square root sign. 2.3. Misclassification rates The metrics described above have often been criticized for being too subjective. A less subjective measure of an adequate measurement system discussed in the literature is the misclassification rate, since it is based on the actual performance of the measurement process. The adequacy of a measurement system can be determined by its ability to distinguish between good and bad parts. Misclassification of a good part as bad (false failure) and the misclassification of a bad part as good (missed fault) can be costly. The likelihood or probability a system will misclassify parts is an important metric that should be estimated in addition to the metrics corresponding to the other criteria. Unlike the measures ndc, SNR,andD R, misclassification rates provide a more objective measure of the system capability. Unlike the number of categories metrics, the misclassification rates are dependent upon the specification limits. Consider the model Y = X + E where Y is the measured value of a randomly selected part, X is the true measurement of the part, and E is the measurement error. It is assumed that X and E are independent where X N(μ P, σ 2 P ) and E N(μ M, σ 2 E ). As a result, Y is normally distributed with mean μ Y = μ P + μ M and variance σ 2 Y = σ2 P + σ2 E. The mean μ M is considered the measurement bias, which is often assumed to be 0. Next, we say that the part is in conformance if LSL<X<USL. Also, a measurement system will pass a part if LSL<Y <USL. There are two types of possible misclassifications of a part. The first misclassification is a false failure. This occurs when the part really is in conformance, but is not passed. The second type of misclassification is a missed fault. This occurs when the part is not in conformance, but the measurement

102 W. H. WOODALL AND C. M. BORROR system passes the part. Furthermore, the probability of a false failure is sometimes referred to as producer s risk while the probability of a missed fault is called the consumer s risk. Both of these types of misclassifications can be quite costly. The probabilities of these types of misclassifications can be calculated using joint probabilities or conditional probabilities (see Burdick et al. 9 ). We present the joint probabilities for illustration. The probability of a false failure is The probability of a missed fault can be found by δ = P(LSL<X<USL and (Y <LSL or Y >USL)) β = P((X<LSL or X>USL) and LSL<Y <USL) If δ, β, or both are unacceptably large, then the measurement system is not acceptable. Unfortunately, there are no standard acceptable levels to which these values can be compared. Therefore, Burdick et al. 9 suggested comparing the misclassification rates given above to the misclassification rates that one would get by employing some chance process. A chance process is simply obtained by taking no measurements, and just randomly assigning π of the parts as good, and 1 π of the parts assigned as bad, where π is the assumed known value of the proportion of non-conforming parts. The misclassification rates for the chance process are then and δ chance = P(LSL<X<USL and (Y <LSL or Y >USL)) = π(1 π) β chance = P((X<LSL or X>USL) and LSL<Y <USL) = (1 π)π Based on the definition of a chance process, if the measurement system in place is no better than what one could get by randomly assigning parts as good or bad, the current measurement system is extremely poor and is not useful in measuring process capability. Burdick et al. 1,2,9 presented confidence intervals on the ratio of the misclassification rates for the current process to the misclassification rates for the chance process. Specifically, confidence intervals are constructed on the ratios: and δ δ chance β β chance The interpretation of the constructed confidence intervals is straightforward. If the confidence interval contains the value of unity, this indicates that the measurement system in place may be no better than a measurement system where parts are classified by some chance probability (i.e. some apriorichance probability). For example, if the confidence interval on δ/δ chance contains the value 1, then δ/δ chance could be equal to 1, indicating that δ = δ chance. The desirable result is that the entire confidence interval covers a range of values less than 1. The reader is encouraged to see Burdick et al. 1,2,9 for complete descriptions and development of these misclassification rates and confidence intervals.

SOME RELATIONSHIPS BETWEEN GAGE R&R CRITERIA 103 3. CRITERIA RELATIONSHIPS Although the measurement capability metrics have been reported and the relationship between several of the criteria presented in the literature, the redundancy of the ndc criteria deserves further attention. The equivalency of the percent of total variation and the ndc criteria was shown by Burdick et al. 1 and Larsen 6. Several authors have also reported on the relationship between criteria such as SNR and the percent of total variation as well as D R and the per cent of total variation. In this section, we further explore these relationships and make some recommendations. Relatively simple algebra shows, that the requirement that ndc be at least 5 is equivalent to the requirement that ˆρ M be less than 27.14% or, equivalently, ˆρ M <7.4%, if one prefers to consider the ratio of variances. This implies that the use of the ndc criterion in assessing the adequacy of the measurement system could be considered to be redundant. In fact, one can calculate the value of ndc, if desired, from the total gage R&R %Study Var using the following relationship: ndc = 1.41[(%Study Var) 2 1] 1/2 = 1.41[(ˆρ M ) 1 1] 1/2 If ˆρ M < 0.10, then the value of ndc is at least 14. Such a value seems to be much more in line with the common 10 1 (or ten-bucket ) rule of AIAG (p. 43) 3 than a minimum ndc value of five. The 10 1 rule is interpreted to mean that the measuring equipment should be able to discriminate to at least one-tenth of the process variation. Of course, one can still use ndc to help characterize the performance of the measurement system, if desired, but the interpretation of ˆσ/ˆσ T (or ˆρ M ) seems much more simple and intuitive. There is a close relationship between ndc and the discrimination ratio (D R ) discussed by Wheeler 7 and Wheeler and Lyday (Chapter 5) 10. If one uses 1.41 as an approximation of 2 1/2 then D R =[ndc 2 + 1] 1/2 Wheeler and Lyday (p. 59) 10 stated that it might be well to work on the measurement process when D R falls below 4.0 or so, a criterion less stringent than requiring ndc to be 5 or greater. The discrimination ratio was also shown to be related to both the SNR and percent of total variation criteria given by Burdick et al. 1 and Larsen 6. 3.1. Confidence intervals The ndc, SNR,andD R metrics discussed in the previous sections have often been criticized in the literature for being too subjective and not adequately describing the capability of the measurement system. One criticism has been that the single-number estimate does not reflect the variability in the estimator. Therefore, confidence intervals on the various metrics are preferred to single-number estimates. Confidence intervals for some of these criteria metrics are well developed and discussed in the literature (see, e.g. Burdick et al. 1,2, Burdick and Larsen 11, Conors et al. 12, Montgomery and Runger 13, and Vardeman and Van Valkenburg 14 ). Furthermore, confidence intervals on misclassification rates have been developed by Burdick 1,2,9. 4. EXAMPLE A discussion of the redundancy of the number of categories criteria and the effectiveness of more objective criteria such as the misclassification rate would be incomplete without an illustration. In this example, we will provide estimates of each of the criteria presented here as well as 95% confidence intervals for each. Consider a wheat-based biscuit cooking process similar to that described by Srikaeo et al. 15. In this type of process, there are several process variables of interest including wheat moisture content (WtMoist), measured in per cent. The specifications on this particular variable are LSL = 9% and USL = 13%. Suppose 10 wheat samples were selected and measured for moisture content by three operators. Each operator measured each

104 W. H. WOODALL AND C. M. BORROR sample three times. An analysis of variance was conducted on the 90 resulting moisture contents with the results displayed in Table II. Based on the analysis of variance, wheat samples, operators, and the sample-byoperator interaction are all statistically significant. Estimates and 95% confidence intervals for each criteria quantities are reported in Table III. The confidence intervals reported in Table III were calculated using approximate closed-form methods for SNR, ndc, andd R. Generalized inference methods are used to construct confidence intervals for the misclassification rates and ˆρ M (see Burdick et al. 2 for complete descriptions of these confidence intervals calculations). Based on the point estimates for the number of categories criteria, SNR, ndc, andd R the process appears to be capable. All four criteria satisfy the recommendation that these values be greater than 4 (or 5). The point estimate for ˆρ M is slightly higher than the recommended cutoff of 10%, but much less than the equivalent cutoff of 27.14% shown earlier (for ndc to be at least 5). But, further investigation of the 95% confidence bounds on these criteria indicates a very wide range of possible values (as low as 1.2 and as high as 14) for the number of categories criteria. Based on these intervals, the process may or may not be capable. The 95% confidence interval on ˆρ M shows that this value could be between 5.87% and 16.66%. Again, the capability of the process is unknown based on this criterion. There are no single point estimates available for the misclassification rates, only lower and upper bounds for a given level of confidence. The 95% confidence interval on the false failure ratio (δ/δ chance ) is (0.1048, 4.486). Since this interval contains unity, the current measurement process in place may be no better than a measurement process that can be obtained by chance. In practice this would indicate that measurement system has a high probability of misclassifying a good part as bad. The 95% confidence interval on the missed fault ratio (β/β chance ) is (0.0878, 0.3641). Since this entire interval covers a range less than unity, the measurement system in place is better than a chance process in terms of the missed fault misclassification. That is, the current measurement system has a relatively low probability of misclassifying a bad part as good. Table II. Analysis of variance for wheat moisture content Source DF SS MS F P Sample 9 38.8032 4.3115 680.76 0.000 Operator 2 0.3982 0.1991 31.44 0.000 Sample operator 18 0.4884 0.0271 4.28 0.000 Error 60 0.3800 0.0063 Total 89 40.0699 Table III. Results of wheat-based biscuit cooking process Estimator Estimate 95% confidence interval ˆρM 0.113 0.0587 0.1666 SNR a 5.006 1.2788 9.9088 SNR b 7.079 1.8085 14.0132 ndc 7.058 1.8031 13.9714 D R 7.128 2.0618 14.0071 δ/δ chance 0.1048 4.4860 β/β chance 0.0878 0.3641 a SNR = ˆγ. b SNR = 2ˆγ.

SOME RELATIONSHIPS BETWEEN GAGE R&R CRITERIA 105 It is important to note that the number of operators used by Srikaeo et al. 15 and recreated here is not recommended for an adequate gage R&R study. Researchers have shown that the number of operators in a gage R&R study should be larger than the often standard use of three when the operator effect is considered random (see, e.g. Burdick et al. 1,2,9 ). As a result of the small number of operators, the confidence intervals, as shown in this example, are quite wide. Increasing the number of operators in any gage R&R study, when operator effect is indeed random, is recommended. 5. CONCLUSIONS In this paper, the equivalency of several standard measurement system criteria was presented and highlighted. In addition, the use of these criteria is discussed and illustrated in an example involving the wheat-based biscuit cooking process. Acknowledgements The authors appreciate the input of Donald J. Wheeler on a previous version of this paper. REFERENCES 1. Burdick RK, Borror CM, Montgomery DC. A review of methods for measurement systems capability analysis. Journal of Quality Technology 2003; 35:342 354. 2. Burdick RK, Borror CM, Montgomery DC. Design and Analysis of Gauge R&R Studies: Making Decisions with Confidence Intervals in Random and Mixed ANOVA Models (SIAM-ASA Series on Statistics and Applied Probability). SIAM: Philadelphia, ASA: Alexandria, VA, 2003. 3. Automotive Industry Action Group (AIAG). Measurement System Analysis (3rd edn). AIAG: Detroit, MI, 2002. 4. Wheeler DJ, Lyday RW. Evaluating the Measurement Process. SPC Press, Inc.: Knoxville, TN, 1984. 5. Automotive Industry Action Group (AIAG). Measurement System Analysis (2nd edn). AIAG: Detroit, MI, 1995. 6. Larsen GA. Measurement system analysis the usual metrics can be noninformative. Quality Engineering 2002; 15:293 298. 7. Wheeler DJ. Problems with gauge R&R studies. ASQC Quality Congress Transactions, Nashville, 1992; 179 185. 8. Mader DP, Prins J, Lampe RE. The economic impact of measurement error. Quality Engineering 1999; 11:563 574. 9. Burdick RK, Park Y-J, Montgomery DC, Borror CM. Confidence intervals for misclassification rates in a gauge R&R study. Journal of Quality Technology 1992; 37:294 303. 10. Wheeler DJ, Lyday RW. Evaluating the Measurement Process (2nd edn). SPC Press, Inc.: Knoxville, TN, 1989. 11. Burdick RK, Larsen GA. Confidence intervals on measures of variability in R&R studies. Journal of Quality Technology 1997; 29:261 273. 12. Conors M, Merrill K, O Donnell B. A comprehensive approach to measurement system evaluation. ASA Proceedings of the Section on Physical and Engineering Sciences, Alexandria, VA, 1995; 136 138. 13. Montgomery DC, Runger GC. Gauge capability and designed experiments, part I: Basic methods. Quality Engineering 1993; 6:115 135. 14. Vardeman SB, Van Valkenburg ES. Two-way random-effects analyses and gauge R&R studies. Technometrics 1999; 41:202 211. 15. Srikaeo K, Furst J, Ashton J. Characterization of wheat-based biscuit cooking process by statistical process control techniques. Food Control 2005; 16:309 317. Authors biographies William H. Woodall is a Professor of Statistics at Virginia Tech. He holds a BS in Mathematics from Millsaps College (1972), and a MS (1974) and PhD (1980) in Statistics from Virginia Tech. His research

106 W. H. WOODALL AND C. M. BORROR interests are statistical quality control and improvement, all aspects of control charting, public health surveillance, and critiques of fuzzy logic. Dr Woodall is the author of over 70 refereed journal articles. He is a former editor of the Journal of Quality Technology (2001 2003), associate editor of Technometrics (1987 1995), and serves on the editorial review board for the Journal of Quality Technology (1988 present). He is the recipient of the Shewhart Medal (2002), Jack Youden Prize (1995, 2003), Brumbaugh Award (2000, 2006), Ellis Ott Foundation Award (1987), and best paper award for IIE Transactions on Quality and Reliability Engineering (1997). He is a Fellow of the American Statistical Association, a Fellow of the American Society for Quality, and an elected member of the International Statistical Institute. Connie M. Borror is an Associate Professor in the Mathematical Sciences and Applied Computing Department. She received her BS degree in Mathematics in 1988 and a Master s degree in Mathematics in 1992 both from Southern Illinois University at Edwardsville. Dr Borror earned her PhD in Industrial Engineering from Arizona State University in 1998. Her research interests include experimental design, response surface methods, statistical process control, and applied industrial statistics. Her e-mail address is cborror@asu.edu