Concentration-based Delta Check for Laboratory Error Detection

Northeastern University Department of Electrical and Computer Engineering Concentration-based Delta Check for Laboratory Error Detection Biomedical Signal Processing, Imaging, Reasoning, and Learning (BSPIRAL) Group Author: Jamshid Sourati Reviewers: Deniz Erdogmus Murat Akcakaya Steve C. azmierczak Todd. Leen Supervisor: Deniz Erdogmus June, 25 * Cite this report in the following format: Jamshid Sourati, Concentration-based Delta Check for Laboratory Error Detection, Technical Report, BSPIRAL-563-R, Northeastern University, 25.

Abstract Investigating the variation of clinical measurements of patients over time is a common technique, known as delta check, for detecting laboratory errors. They are based on the expected biological variations and machine imprecision, where the latter varies for different concentrations of the analytes. Here, we present a novel delta check method in the form of composite thresholding, and provide its sufficient statistics by constructing the corresponding discriminant function, which enables us to use statistical and learning analysis tools. Using the scores obtained from such a discriminant function, we statistically study the performance of our algorithm on a labeled data set for the purpose of detecting lab errors.

Contents Introduction 3 2 Notations 3 3 Decision Rules 4 3. Causal Delta Check................................ 4 3.2 Non-Causal Delta Check............................. 4 4 Data Analysis 5 4. Sufficient Statistics................................ 5 4.2 Estimating the ROC Curves........................... 6 4.3 Estimating the AUC............................... 7 4.4 Comparing the ROC Curves........................... 9 5 Experimental Results 5. Experiment Settings............................... 5.2 Statistics...................................... 6 Conclusion 2

Introduction Quality control is an important stage in analyzing specimens in clinical laboratories. The goal is to detect the erroneous measurements in a pool of clinical samples. Traditional quality control systems represent a significant expenditure of financial and personnel resources, and are sensitive to no more than % of the laboratory errors (Witte et al., 997; Hawkins, 22). Computer-aided algorithms are recently developed for automatically detecting errors among the samples. Each clinical measurement is a vector of numbers, where each component represents the evaluation of the concentration of a specific substance, or analyte, in the patient s blood. Most automatic algorithms are based on the differences between the analytes of the current measurement of a given patient and those in a prior measurement of the same patient. Normally, these changes do not exceed certain upper limits, unless there is an error in the reported the analyte values. Such variation-based detection techniques are usually called delta checks in the clinical literature (Strathmann et al., 2). Delta checks can be more complicated than simple thresholding. For instance, the magnitude of the analyte variations usually change in different concentrations (Ricos et al., 29). Therefore, a composite thresholding is needed, where both the cut-off parameters and the values to be thresholded are different across the concentration ranges. Such approaches are not straightforward to analyze as they do not match the framework of standard statistical and learning analysis. For instance, most of the theoretical results regarding the computation and comparison of the ROC curves, have the assumption that a one-dimensional continuous test is being thresholded for labeling the samples (Pepe, 23). In this paper, we present a novel composite delta check, designed for automatically detecting erroneous clinical measurements, and also present its sufficient statistics. These statistics are obtained by building the discriminant functions of our decision rules and can also be viewed as a dimensionality reduction of the feature vectors. Calculating the scalar scores using such discriminant functions enables us to apply various statistical and learning tools to analyze the delta check, say, by building learning models or using statistical analysis of the data. Here, as simple examples, we use them to evaluate the performance of our detection algorithm by means of statistical analysis based on the ROC curves. Under this analysis, we discuss the strength of each analyte in distinguishing the erroneous samples. 2 Notations Throughout this paper, we use Ω to denote the set of analytes that are evaluated for each patient. The vector x = [x,..., x d ] R d (where d = Ω ) contains evaluation of the analytes in Ω for a given patient at a specific time. The difference between the analyte values in x and the prior and subsequent measurements of the same patient, within a 24 hours interval, are denoted by the variation vectors x = [ x,..., x d ] and x = [ x,..., x d ], respectively. Also assume that n denotes the number of samples available. Finally, Φ( ) is CDF of the standard normal distribution N (, ) and (A) is an indicator function, whose value is, if the expression A is true, and otherwise. Lab error detection problem: given a set of evaluated analytes stored in x, together with either one or both types of variations x and x, use all or a subset of analytes to 3

determine if there exists an erroneous measurement in x. 3 Decision Rules In this section, we discuss forming the feature vectors and formulate our decision rules in two general cases: () causal check using only x, and (2) non-causal check using both x and x. A decision rule is a mapping from the feature space to the binary space {, }, where represents the decision of labeling a sample as error, and otherwise. 3. Causal Delta Check In order to take into account different concentrations, we divide [, ], as the range of all possible values of the analyte evaluations, into three intervals. For instance, for analyte indexed by a ( a d), we get [, l a ], (l a, u a ] and (u a, ) for, where each interval is assigned a different threshold: β,a (an absolute value), β 2,a and β 3,a (in form of percentage with respect to ), respectively. First, suppose the decision is to be made based on the individual analyte a. The feature vector is constructed as y a = [ x a ]. Then, our decision rule in this case, denoted by h a, is ( ) β,a, l a ( ) h a (y a ) = β 2,a, l a < u a () ( ) β 3,a, u a < Now, if we consider all the analytes in Ω, we construct the feature vectors by including all the values and variations y = [ x x }{{} y x 2 x 2 }{{} y 2... x k x d }{{} y d ], The sample will be labeled as an error, if the variation of at least one of the analytes exceeds the threshold. Hence, we can formulate the decision rule as 3.2 Non-Causal Delta Check h (y ) = max i d {h i (y i )}. (2) Here, we use the same partitioning of concentration ranges of the analytes, and the same thresholds in each division, as in the previous case. Again, let us first focus on a single analyte indexed by a. The feature vector will be constructed as y a = [ x a x a ]. Then, we label the sample as an error, if the variations in both directions, i.e. x a and x a, violate the corresponding threshold. Such decision 4

rule can be written as h a (y a ) = (3) ( ) ( ) β,a β,a, l a ( ) ( ) β 2,a β 2,a, l a < ( ) ( ) u a β 2,a β 3,a, u a < The feature vector when using all the analytes in Ω is constructed as y = [x } x {{ x } y x } 2 x 2 {{ x 2} y 2... x } d x d {{ x d} y d ] As before, when considering multiple analytes, we label the given sample as an error, if there exists one analyte a Ω that has erroneous measurement according to h a (y a ). Hence, a similar equation to (2) holds for the non-causal delta check. 4 Data Analysis In this section, we present the sufficient statistics by constructing the discriminant functions of our delta check. Evaluation of the discriminant functions gives us scalar scores that tends to be larger in error class. The ROC curves of the decision rules can be empirically estimated by varying the threshold over the scores and each time classifying those exceeding the threshold as error. Here, we also discuss estimation of the AUC values and their confidence intervals, as well as comparing the performance of single-analyte delta checks under usage of different analytes, by means of a one-sided hypothesis testing over the difference between their AUC values. 4. Sufficient Statistics The idea in constructing the statistics is to relax the indicator functions in the decision rules that compare the (normalized) variation with the thresholds (see () or (3)). First, note that we can rewrite each of the decision rules in a single expression. For example, the single-analyte causal decision rule in () can be reformulated as ( ) h a (y a ) = ( l a ) β,a ( ) (l a < u a ) β 2,a ) ( (u a < ) β 3,a (4) 5

By relaxing the second indicator function in each term, we get the following discriminant score (function): ( [ ]) ( ) s a y xa a = x = ( l a ) a β,a ( ) (l a < u a ) β 2,a (u a < ) ( ) β 3,a Note that the embedded scalar tends to be larger in positive direction when y a is actually an erroneous sample. Our decision is obtained by thresholding the score values by zero. The discriminant score for the causal multiple-analyte case is constructed by replacing the decision rules h i in equation (2) by the scalar transformation s i : s (y ) = max i d {s i (y i )}. (6) Construction of the discriminant function is more complicated in case of non-causal delta checks. Note that relaxing the indicator functions of (3) is troublesome, because even if the sample is correct and therefore the differences between the (normalized) variations and the thresholds in both directions tend to be negative, their multiplication results a positive value. In order to resolve this issue, we reformulate this rule by replacing the multiplication of the indicators by their minimum value (see (7)). Therefore, the function can be written as in (8). Finally, a maximization similar to equation (6) will suffice to get s(y) for the multiple-analyte case. { ( ) ( )} min β,a, β,a, l a { ( ) ( )} h a (y a ) = min β 2,a, β 2,a, l a < u a (7) { ( ) ( )} min β 3,a, β 3,a, u a < (5) s a y a = x a x a {( ) ( )} = ( l a ) min, (8) β,a β,a {( ) ( )} (l a < u a ) min β 2,a, β 2,a x {( ) ( a )} (u a < ) min β 3,a, β 3,a 4.2 Estimating the ROC Curves After embedding the feature vectors into a scalar space, the ROC curves can be easily estimated empirically by varying the threshold over the transformed scalar values and computing 6

BUN Creatinine Na TPR.5.5.5 CO2 TPR 9% CI 95% CI 99% CI Mean y=x FPR (b) Causal multiple-analyte.5.5.5 Ca.5.5 FPR Alb.5 TPR 9% CI 95% CI 99% CI Mean y=x FPR (a) Causal single-analytes (c) Non-causal multiple-analyte Figure : ROC curves, together with the confidence regions, of the single- and multi-analyte delta checks the false positive rate (F P R) and true positive rate (T P R) in each case. In order to compute the ( α) confidence region of a given operating point (F P R, T P R), we consider the number of false positives (F P = n F P R) and true positives (T P = n T P R) as two independent binomial random variables with success probabilities F P R and T P R, respectively, and n as the number of trials. Then, the confidence intervals of F P R and T P R can be computed accordingly (Johnson et al., 25). Let I,α and I 2,α be the ( α ) confidence interval of F P R and T P R respectively, where α = α, then because of the independence assumption, the rectangle I,α I 2,α is the (α) confidence interval of the pair (F P R, T P R). We take the union of such rectangles obtained for all the operating points as the ( α) confidence region of our empirical ROC. 4.3 Estimating the AUC In formulations of this section, we focus on the non-causal multiple-analyte case only, but it can be done for all the other cases as well. The AUC value can be estimated by either computing the area under the empirical ROC curve, or using the survivor functions under the error and non-error classes, which are the same as probabilities of detection and false 7

BUN Creat Na CO2 Ca Alb Ω BUN Creat Na CO2 Ca Alb Ω..3.5.7.9 348.57 39.776 66 46.7446 349 69 386 9% 95% 99% (a) Causal..3.5.7.9.3948 67 689.7722 24 99.7526 798.5958.992 9% 95% 99% (c) Non-causal BUN Creatinine Na CO2 Ca Albumin Albumin CO2 Ca Creatinine Na BUN BUN Creatinine Na CO2 Ca Albumin Albumin CO2 Ca Creatinine Na BUN (b) Causal (d) Non-causal.9.7.5.3..9.7.5.3. Figure 2: AUC values, together with their confidence intervals, are shown for the single- and multiple-analyte delta checks (a,c); the resulting p-values of the pairwise hypothesis tests, described in Section 4.4 are also displayed (b,d). alarm, respectively ( y R): error : S e (y) = P r(s(y) y y is error), (9a) non-error : S n (y) = P r(s(y) y y is not error), (9b) These functions can be approximated empirically or using kernel density estimation (DE). We denote the given approximations by Ŝe and Ŝn. Let us denote the vector of scalar scores by s = [s,..., s n ], where s i = s(y i ). Without loss of generality, suppose that the transformed values in s are sorted such that the first k scalars are errors and the rest are sound measurements. Then, the AUC and its variance can be approximated as below (DeLong 8

et al., 988): AUC k k Ŝ n (s i ), i= Var[AUC] k Var{Ŝn(s i ) i =,..., k} n k Var{Ŝe(s i ) i = k,..., n} (a) (b) The confidence ( interval of ) AUC can also be approximated by assuming that its logit transformation, i.e. log AUC AUC, is distributed normally with the variance calculated in (b) (Pepe, 23). 4.4 Comparing the ROC Curves For two given analytes a, b Ω, denote the vector of single-analyte sorted scores in noncausal case by s a and s b, respectively (the formulation for the causal case is exactly the same). In order to compare the performance of delta checks using these analytes, we consider the difference between their AUC values, AUC a,b = AUC a AUC b, where AUC a and AUC b are estimated based on s a and s b, respectively. However, these two AUC values are not independent, as s a and s b are computed from the same data. Therefore, the variance of AUC a,b is not additive. It can be shown that this variance can be approximated as below (DeLong et al., 988): Var[ AUC a,b ] = k Var {Ŝ n (s ai ) Ŝ n(s bi ) i =,..., k } {Ŝ n k Var e (s ai ) Ŝ e(s bi )) } i = k,..., n Then, assuming that AUC a,b N (µ, Var[ AUC a,b ]), the following one-sided hypothesis test is performed: H : µ H : µ > So H is the hypothesis that the AUC obtained by using analyte a is no better than that obtained by using analyte b. The AUC difference tends to be larger under H than under H. Therefore, it is natural to reject H when AUC a,b is large. We take the test statistic AUC a,b Var[ AUC a,b ] as T = and reject H when T c, for some critical value c. The power ( ) µ function of this test is Φ c, which gives the test size of Φ(c). We Var[ AUC a,b ] ( ) AUC are interested in the p-value of the test which can be shown to be Φ a,b. Var[ AUC a,b ] By definition, the smaller the p-value of a certain observed AUC difference is, the stronger evidence we have to reject H, i.e. using analyte a is more likely not to be equal or worse than using analyte b. () 9

5 Experimental Results 5. Experiment Settings The data that we evaluated consisted of laboratory test values obtained from hospital inpatients with renal failure, 8 years of age or older, seen at Oregon Health & Science University during the time period of October 2 through September 2. The set of analytes, Ω, that we considered in constructing sample vectors are urea (BUN), creatinine, sodium (Na), potassium (), chloride (), total carbon dioxide (CO2), calcium (Ca), phosphorus () and albumin. Serial data from samples that were noted by the laboratory to be hemolyzed and which could have significantly affected the measured values of certain analytes (i.e., potassium) were excluded from our evaluation. We queried 85 samples, consisting of a mixture of randomly selected and low-likelihood samples under a GMM, to get their labels from the clinical experts, with 64 (7.95%) of the samples showing an error. Among them we got 436 samples with at least one prior measurement (with 37 errors), and 254 samples with both prior and subsequent measurements (with 26 errors). Therefore, n is different for our causal and non-causal training data sets. The parameters β,a, β 2,a and β 3,a are determined based on the physiological variation in analyte a within the individual, and machine imprecision. The former is specified based on recent literature (Ricos et al., 29), and the latter based on quality control data obtained from the instrument used to measure the analytes. The concentration ranges of the analytes are also provided by the clinicians. Notice that these parameters can also be learned, for example, by doing a cross validation over the obtained labeled data. Furthermore, the survivor functions in (9) are computed using DE based on Gaussian kernels with empirically fixed kernel widths ( 5 ). 5.2 Statistics The ROC curves obtained for the single- and multiple-analyte decision rules are shown in Figure. Because of lack of space, and that the curves of single-analyte delta checks were very similar in causal and non-causal modes, we showed only the former. From Figure (a), BUN had the worst performance, and potassium and calcium were among the best ones, in terms of their average ROC curve (the magenta curve). Other analytes were somewhere in between. The AUC values of these ROC curves, together with their confidence intervals, are shown in Figure 2(a,c). We can observe that relative performance of the single analytes in terms of the AUC values are similar in both causal and non-causal modes, with longer confidence intervals for the latter. This is because n is smaller in the non-causal case. The resulting p-values of our pairwise hypothesis tests are also shown in Figure 2(b,d). For each case, a matrix is displayed whose (a, b) entry ( a, b d) represents the p-value of the hypothesis over AUC a,b. Therefore, darkness for such an entry implies that our data provide strong evidence against the null hypothesis that analyte a does not outperform analyte b. Observe that the darkest rows of each matrire those associated with potassium and calcium. That is, in the comparison between these analytes and the rest, it is highly probable that they are not worse than others. Whereas the rows corresponding to BUN and creatinine, are among the brightest ones. These are in accordance with our observations on

ROC curves and the AUC values. Figures (b,c) illustrate the ROC curves of the multiple-analyte decision rules. They showed significantly better performance than the single-analyte checks. This was expected, as they are using the information from all the analytes. Also observe that, not surprisingly, the non-causal mode outperforms the causal mode. Recall that in the former we use the variations in both directions, whereas in the latter we are restricted to use a subset of this knowledge by focusing on only the prior measurement. The AUC values of these mutipleanalyte checks are also shown in the last rows of Figures 2(a,c), which shows that their mean AUC (the red dots) are larger than all the single-analyte AUCs. 6 Conclusion In this paper, we proposed a novel concentration-dependent delta check algorithm and provided its sufficient statistics by constructing the discriminant functions based on the decision rules of our algorithm. Computing the scores with the discriminant functions enabled us to do various statistical analysis, such as empirically estimating the ROC curves and the AUC values, together with their confidence regions. Performance of our proposed delta check under various single analytes are also compared in a pairwise manner based on the difference between their AUC values. In future work, we will incorporate correlations between the variations of the analytes when detecting the lab errors, develop a soft probabilistic classifier in case of multiple-analyte delta check (rather than a hard-max classifier) and also devise an active learning framework based on the proposed discriminant function to efficiently query samples from the clinical experts. References DeLong, E. R., DeLong, D. M., and arke-pearson, D. L. (988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, pages 837 845. Hawkins, R. (22). Managing the pre-and post-analytical phases of the total testing process. Annals of laboratory medicine, 32():5 6. Johnson, N. L., emp, A. W., and otz, S. (25). Univariate discrete distributions, volume 444. John Wiley & Sons. Pepe, M. S. (23). The statistical evaluation of medical tests for classification and prediction. Oxford University Press. Ricos, C., Alvarez, V., and Cava, F. (29). Biologic variation and desirable specifications for qc. Strathmann, F. G., Baird, G. S., and Hoffman, N. G. (2). Simulations of delta check rule performance to detect specimen mislabeling using historical laboratory data. inica Chimica Acta, 42(2):973 977.

Witte, D. L., VanNess, S. A., Angstadt, D. S., and Pennell, B. J. (997). Errors, mistakes, blunders, outliers, or unacceptable results: how many? inical Chemistry, 43(8):352 356. 2