Supplementary Materials for

Size: px

Start display at page:

Download "Supplementary Materials for"

Percival Crawford
5 years ago
Views:

1 advances.sciencemag.org/cgi/content/full/4/2/eaao659/dc Supplementary Materials for Neyman-Pearson classification algorithms and NP receiver operating characteristics The PDF file includes: Xin Tong, Yang Feng, Jingyi Jessica Li Published 2 February 28, Sci. Adv. 4, eaao659 (28) DOI:.26/sciadv.aao659 Proof of Proposition Conditional type II error bounds in NP-ROC bands Empirical ROC curves versus NP-ROC bands in guiding users to choose classifiers to satisfy type I error control Effects of majority voting on the type I and II errors of the ensemble classifier table S. Results of LR in Simulation S2. table S2. Results of SVMs in Simulation S2. table S3. Results of RFs in Simulation S2. table S4. Results of NB in Simulation S2. table S5. Results of LDA in Simulation S2. table S6. Results of AdaBoost in Simulation S2. table S7. Description of variables used in real data application. table S8. The performance of the NP umbrella algorithm in real data application 2. table S9. Input information of the nproc package (version 2..9). Other Supplementary Material for this manuscript includes the following: (available at advances.sciencemag.org/cgi/content/full/4/2/eaao659/dc) R codes and data sets

2 Supplementary Materials Proof of Proposition Proof: In our context, let T (k) be the k-th ordered classification score of a left-out class sample (i.e., class sample not used to train a base algorithm). Suppose T (k) is the chosen threshold on classification scores. Then the classifier is ϕ k = (T > T (k) ), where T is the classification score of an independent observation from class, and the population type I error of ϕ k given T (k) is P[T > T (k) T (k) ] = F(T (k) ), where F is the cumulative distribution function of T i s. Hence, the violation rate (the probability that the population type I error of ϕ k exceeds α) is P[ F(T (k) ) > α] = P[F(T (k) ) < α] = P[T (k) < F ( α)] = P[T () < F ( α),, T (k) < F ( α)] = P[at least k of the T i 's are less than F ( α)] n = P[exactly j of the T i 's are less than F ( α)] j=k n = ( n j ) P[T i < F ( α)] j ( P[T i < F ( α)]) n j j=k n ( n j ) ( α)j α n j, (S) j=k where the last inequality holds because P[T i < F ( α)] α, and it becomes an equality when F is continuous. Remark: The proof does not have any assumptions on F. Regardless of the continuity of F, we define its inverse as F ( ) = inf{x: F(x) }, which has the property: x F(y) if and only if F (x) y, for any x [,] and y in the domain of F.

3 Conditional type II error bounds in NP-ROC Bands Given training data S = S S, where S and S are class and class samples, respectively, we randomly split S into S and S 2 and split S into S and S 2. For simplicity, we let S = S 2 = n and S = S 2 = m, and express the two left-out samples as S 2 = {x,, x n } and S 2 = {X,, X m }. In the following discussion, we treat S, S 2 and S as fixed (by conditioning on them) and only consider the m data points in S 2 as random variables. We train a base classification algorithm (e.g., SVM) on S S and denote the resulting classification scoring function as f. Because f is a function of S S, in our discussion here, f is a fixed function that maps X to R. After applying f to the left-out samples S 2 and S 2, we denote the resulting classification scores as t i = f(x i ), i =,, n, and T j = f(x j ), j =,, m, respectively. Suppose that we decide to use the k-th ordered left-out class score, t (k), as the score threshold. We then construct an NP classifier ϕ k (based on one (M = ) random split) as We then find the corresponding rank of t (k) There are three scenarios. Scenario If T () t (k) ϕ k(x) = I(f(X) > t (k) ). among the left-out class scores T,, T m. T (m), we define the lower bound rank r L and the upper bound rank r U as r L = max{r {,, m}: T (r) t (k) }, (S2) r U = min{r {,, m}: T (r) t (k) }, (S3) and denote their corresponding classifiers as ϕ rl (X) = I(f(X) > T (rl )), (S4) ϕ ru (X) = I(f(X) > T (ru )). (S5) We define the conditional (here the conditioning is on training data S, S 2 and S ) type II errors of ϕ rl and ϕ ru as R c (ϕ rl ) P[f(X ) T (rl ) R c (ϕ ru ) P[f(X ) T (ru ) T (rl ) T (ru ) ] = F (T (rl )), (S6) ] = F (T (ru )), (S7) where X represents a new class observation, and F is the cumulative distribution function of classification scores of class observations. Because T (rl ) t (k) T (ru ), we have R c (ϕ rl ) R c (ϕ k) R c (ϕ ru ), (S8)

4 where R c ( ) stands for the conditional type II error, and R c (ϕ k) = P[f(X ) t (k) ] = F (t (k) ). In (S8), the left inequality becomes tight when T (rl ) = t (k) and the right inequality is tight when T (ru ) = t (k). Given a pre-specified tolerance level δ, denote by β L (ϕ k) and β U (ϕ k) the ( δ) high probability lower and upper bounds of R c (ϕ k). Specifically, β L (ϕ k) and β U (ϕ k) are defined as m β L (ϕ k) sup {β [,]: ( m j ) βj ( β) m j δ}, (S9) j=r L m β U (ϕ k) inf {β [,]: ( m j ) βj ( β) m j δ}. (S) j=r U The reason that (S9) and (S) give valid ( δ) high probability lower and upper bounds is as follows. A constant β serves as a valid ( δ) high probability lower bound of R c (ϕ k) if P[R c (ϕ k) β] δ. Since by (S8) P[R c (ϕ k) β] P[R c (ϕ rl ) β]. in order to have (S) hold it is sufficient to have P[R c (ϕ rl ) β] δ. By (S6) P[R c (ϕ rl ) β] = P[F (T (rl )) β] = P[F (T (rl )) < β] m ( m j ) βj ( β) m j, j=r L where the inequality in the second line follows from (S) in the proof of Proposition. Hence, it suffices to have (S) m ( m j ) βj ( β) m j δ j=r L to make (S) hold. Among all the β values that satisfy (S), we would choose the supremum as the ( δ) high probability lower bound of R c (ϕ k), leading to (S9). A constant β serves as a valid ( δ) high probability upper bound of R c (ϕ k) if P[R c (ϕ k) β] δ. (S2) Since by (S8) P[R c (ϕ k) β] P[R c (ϕ ru ) β], in order to have (S2) hold it is sufficient to have P[R c (ϕ ru ) β] δ.

5 By (S7), P[R c (ϕ ru ) β] = P[F (T (ru )) β] = P[F (T (ru )) < β] = ( m j ) βj ( β) m j j=r U where the two equalities in the second line follow from (S) in the proof of Proposition, under the assumption that F is continuous, which holds for most classification algorithms. Hence, it suffices to have m m ( m j ) βj ( β) m j δ j=r U to make (S2) hold. Among all the β values that satisfy (S2), we would choose the infimum as the ( δ) high probability upper bound of R c (ϕ k), leading to (S). Scenario 2 If t (k) > T (m), we define the lower bound rank r L the same as in (S2) and the ( δ) high probability lower bound β L (ϕ k) the same as in (S9). We set the ( δ) high probability upper bound β U (ϕ k) =. Scenario 3 If t (k) < T (), we define the upper bound rank r U the same as in (S3) and the ( δ) high probability upper bound β U (ϕ k) the same as in (S). We set the ( δ) high probability lower bound β L (ϕ k) =. In all the above three scenarios, we have P[R c (ϕ k) < β L (ϕ k)] δ and P[R c (ϕ k) > β U (ϕ k)] δ, leading to P[β L (ϕ k) R c (ϕ k) β U (ϕ k)] 2δ.

6 Empirical ROC curves versus NP-ROC bands in guiding users to choose classifiers to satisfy type I error control In practice, the popular ROC curves cannot provide proper guidance for comparing two classifiers whose type I errors are bounded from above by some α, because ROC curves are constructed based on empirical type I and type II errors, which are calculated from test data or cross-validation on training data and do not display population type I error information. We refer to such ROC curves as empirical ROC curves in the main text to differentiate them from the oracle ROC curves. Concretely, given an empirical ROC curve of a classification method, it is unclear how users should decide which point on the curve corresponds to a classifier satisfying the type I error bound α. Through Simulation and Fig. 2, we showed that users cannot simply pick the point that has empirical type I error (i.e., horizontal axis of the ROC curve) no greater than and closest to α, a seemingly intuitive but actually improper practice. We further illustrate this point in Simulation S and Fig. 3. Due to the lack of direct information on population type I errors, existing methods for constructing ROC confidence bands cannot serve the purpose either. Simulation S. From the same setup as in Simulation : (X Y = ) N(,) and (X Y = ) N(2,), with P(Y = ) = P(Y = ) =.5 N we simulate 2D = 2 data sets {(x (m) i, y (m) i }, where m =,,2D and N =. The i= first D data sets are used as the training data, and the other D data sets are the test data. On the m-th training data set, we construct N classifiers, I(X > x (m) i ), i =,, N, and evaluate their empirical type I and II errors on the m-th test data set (i.e., the (D + m)-th simulated data set), resulting in one ROC curve. We also use the m-th training data set to calculate one NP-ROC lower curve, i.e., the lower curve of an NP-ROC band. Fig. 3 illustrates the D = ROC and NP-ROC lower curves. Suppose that users would like to find a classifier respecting a type I error bound α =.5 with tolerance level δ =.5 from an ROC curve. An intuitive choice is to pick the classifier at the intersection of the ROC curve and the vertical line at α. If there is no classifier right at the intersection, a reasonable idea is to pick the first classifier to the left of the intersection. For the D classifiers chosen in this way, we summarize their empirical type I errors (on the test data) and their population type I errors as histograms (Fig. 3 left panel). The results suggest that although the classifiers have no empirical type I errors greater than α, approximately 3% of the classifiers have population type I errors greater than α, violating users desire for controlling type I error under α with at least.95 probability. On the other hand, the NP-ROC lower curves provide a natural way for users to choose classifiers given α, as the horizontal coordinates of the NP-ROC curves are type I error upper bounds. Users can simply pick the classifier with horizontal coordinate α. For the D chosen NP classifiers, we summarize their empirical type I errors on the test data and their population type I errors as histograms (Fig. 3 right panel). It is clear that the violation rate of population type I error is under δ =.5. Another use of the NP-ROC lower curve is that it provides a conservative point-wise estimate of the oracle ROC curve. For an NP classifier ϕ, the corresponding point on an NP-ROC lower curve is, with high probability, below and to the right of the point on the oracle ROC curve. To explain this phenomenon, suppose that the classifier corresponds to the point (α(ϕ ), β U (ϕ )) on an NP-ROC lower curve and the point (R (ϕ ), R (ϕ )) on an oracle ROC curve. Then, by the definition of α(ϕ ) (Equation (3)) and β U (ϕ ) (Equation (S)) as the high probability upper bounds on R (ϕ ) and R c (ϕ ) (the conditional type II error, conditioning on the training data), respectively, we know that α(ϕ ) R (ϕ ) and β U (ϕ ) R c (ϕ ) hold with high

7 probability. As R (ϕ ) = E[R c (ϕ ) ], where the expectation is with respect to the joint distribution of the training data, we have β U (ϕ ) R (ϕ ) with high probability. Hence, the point on the NP-ROC lower curve is below and to the right of the oracle ROC curve with high probability. In other words, for an NP classifier, coordinates on the NP-ROC lower curve provide a conservative estimate of the corresponding coordinates on the oracle ROC curve. This phenomenon is visualized in the top right panel of Fig. 3.

8 Effects of majority voting on the type I and II errors of the ensemble classifier In the following simulation, we demonstrate that the ensemble classifiers from multiple random splits maintain the type I error violation rates under δ and achieve reduced average type II errors with smaller standard errors over multiple simulations, as compared with the NP classifiers from just one random split. Simulation S2. Our first generative model is a logistic regression (LR) model with d = 3 independent features and n = observations. For each feature, values are independently drawn from a standard Normal distribution. The three features have non-zero coefficients as 3, 2.4 and.8, respectively. Denoting the feature matrix by X = [x,, x n ] T R n d and the coefficient vector by β R d, we simulate the response vector as Y = (Y,, Y n ) T, with Y i indep Bernoulli ( ). +exp( x T i β) Our second generative model is a linear discriminant analysis (LDA) model with d = 3 independent features and n = observations. We first simulate the response vector as Y = (Y,, Y n ) T, with Y i indep Bernoulli(.5). Then we simulate the feature matrix X = [x,, x n ] T R n d as (x i Y i = ) N((,, ) T, I 3 ); (x i Y i = ) N((2,.6,.2) T, I 3 ). We simulate datasets from each of the above two models, obtaining 2 datasets. With α =.5, δ =.5, and varying number of splits M {, 5, 9,,5}, we construct five NP classifiers for each of six methods (logistic regression, support vector machines, random forests, naïve Bayes, linear discriminant analysis, and AdaBoost) with different numbers of splits based on each data set. We separately simulate two large test data sets with 6 observations from the two models, so that we can use the test data set to calculate the test (empirical) type I and type II errors to approximate the population type I and type II errors. For each combination of M value, classification method and generative model, we use its corresponding NP classifiers to calculate the average of the approximate type I errors and its standard error, the percentage of classifiers with approximate type I errors greater than α (type I error violation rate), and the average of the approximate type II errors and its standard error. The results summarized in tables S-S6 show that the type I error violation rates of the ensemble classifiers stay below δ = 5%, the average type I and type II errors, which approximate the average of the population type I and type II errors of the ensemble classifiers, and their standard deviations decrease from M = to M >. This simulation experiment and its results demonstrate the validity and effectiveness of the ensemble approach by majority voting.

9 table S. Results of LR in Simulation S2. The first column indicates the data generative model (LR: logistic regression; LDA: linear discriminant analysis). The second column indicates the number of random splits on the class training data to construct each ensemble classifier. The third column indicates the (approximate) averages (avg) and standard deviations (sd) of the population type I errors of the ensemble classifiers. The fourth column indicates the proportions of the (approximate) population type I errors exceeding α, i.e., the violation rates. The fifth column indicates the (approximate) averages (avg) and standard deviations (sd) of the population type II errors of the ensemble classifiers. Model M Type I error avg (sd) Type I error violation rate Type II error avg (sd) 2.78% (.%) 2.3% 33.42% (5.9%) % (.8%).% 33.38% (4.2%) LR 9 2.7% (.75%).5% 33.35% (3.89%) 2.7% (.74%).5% 33.37% (3.86%) 5 2.7% (.73%).5% 33.35% (3.8%) 2.82% (.6%) 3.2% 9.22% (4.62%) % (.82%).7% 9.4% (3.63%) LDA % (.78%).4% 9.3% (3.39%) 2.72% (.77%).4% 9.9% (3.33%) % (.75%).4% 9.% (3.27%) table S2. Results of SVMs in Simulation S2. The column meanings are the same as those of table S. Model M Type I error avg (sd) Type I error violation rate Type II error avg (sd) 2.8% (.4%) 3.% 43.22% (9.4%) 5 2.7% (.77%).9% 42.85% (7.53%) LR % (.73%).5% 42.76% (7.25%) 2.68% (.72%).7% 42.8% (7.2%) % (.7%).5% 42.74% (7.5%) 2.8% (.8%) 3.4% 34.% (5.42%) % (.78%).9% 33.25% (2.95%) LDA % (.76%).6% 33.5% (2.75%) 2.72% (.75%).6% 33.% (2.73%) 5 2.7% (.74%).5% 33.5% (2.58%)

10 table S3. Results of RFs in Simulation S2. The column meanings are the same as those of table S. Model M Type I error avg (sd) Type I error violation rate Type II error avg (sd) 2.79% (.3%) 2.7% 45.8% (7.52%) % (.72%).2% 45.5% (5.52%) LR % (.66%).% 44.97% (5.%) 2.3% (.65%).% 44.96% (5.4%) % (.64%).% 44.93% (5.%) 2.8% (.8%) 3.3% 26.% (7.6%) 5 2.6% (.75%).2% 25.8% (4.92%) LDA % (.73%).4% 24.84% (4.6%) 2.57% (.72%).4% 24.8% (4.52%) % (.7%).3% 24.77% (4.42%) table S4. Results of NB in Simulation S2. The column meanings are the same as those of table S. Model M Type I error avg (sd) Type I error violation rate Type II error avg (sd) 2.79% (.%) 2.% 35.3% (5.32%) % (.76%).5% 35.8% (4.%) LR % (.72%).4% 35.3% (3.87%) 2.67% (.7%).3% 35.5% (3.84%) % (.7%).2% 35.% (3.75%) 2.8% (.5%) 3.5% 9.2% (4.58%) % (.82%).8% 9.% (3.62%) LDA % (.77%).4% 9.6% (3.35%) 2.72% (.75%).4% 9.3% (3.28%) 5 2.7% (.75%).4% 9.5% (3.24%)

11 table S5. Results of LDA in Simulation S2. The column meanings are the same as those of table S. Model M Type I error avg (sd) Type I error violation rate Type II error avg (sd) 2.78% (.2%) 3.% 33.48% (5.9%) % (.8%).4% 33.43% (4.4%) LR 9 2.7% (.75%).8% 33.42% (3.9%) 2.7% (.74%).5% 33.44% (3.85%) 5 2.7% (.73%).3% 33.43% (3.82%) 2.8% (.6%) 3.6% 9.5% (4.59%) % (.82%).7% 9.% (3.64%) LDA % (.77%).4% 9.5% (3.36%) 2.72% (.76%).3% 9.4% (3.29%) 5 2.7% (.75%).4% 9.5% (3.25%) table S6. Results of AdaBoost in Simulation S2. The column meanings are the same as those of table S. Model M Type I error avg (sd) Type I error violation rate Type II error avg (sd) 2.8% (.2%) 2.9% 4.5% (6.2%) % (.7%).% 4.87% (4.5%) LR % (.65%).% 4.74% (4.9%) 2.3% (.64%).% 4.77% (4.4%) % (.62%).% 4.72% (4.6%) 2.8% (.9%) 3.6% 24.39% (6.2%) % (.74%).8% 23.39% (4.2%) LDA 9 2.5% (.7%).5% 23.2% (3.93%) 2.5% (.7%).3% 23.7% (3.86%) % (.68%).2% 23.% (3.78%)

12 table S7. Description of variables used in real data application. The data and description are from the Early Warning Project ( Variable Description Values Response mkl.start. Onset of state-led mass killing episode in next year (t + ) {, } Predictors reg.afr US Dept State region: Sub-Saharan Africa {, } reg.eap US Dept State region: East Asia and Pacific {, } reg.eur US Dept State region: Europe and Eurasia {, } reg.mna US Dept State region: Middle East and North Africa {, } reg.sca US Dept State region: South and Central Asia {, } reg.amr US Dept State region: Americas {, } mkl.ongoing Any ongoing episodes of state-led mass killing {, } mkl.ever Any state-led mass killing since WWII (cumulative) {, } countryage.ln Country age, logged [, ] wdi.popsize.ln Population size, logged [4.7889, ] imr.normed.ln Infant mortality rate relative to annual global median, logged [ , ] gdppcgrow.sr Annual % change in GDP per capita, meld of IMF and WDI, square root [ , ] wdi.trade.ln Trade openness, logged [ , ] ios.iccpr ICCPR st Optional Protocol signatory {, } postcw Post-Cold War period (year 99) {, } pol.cat.fl Autocracy (Fearon and Laitin) {, } pol.cat.fl2 Anocracy (Fearon and Laitin) {, } pol.cat.fl3 Democracy (Fearon and Laitin) {, } pol.cat.fl7 Other (Fearon and Laitin) {, } pol.durable.ln Regime duration, logged (Polity) [, ] dis.l4pop.ln Percent of population subjected to state-led discrimination, logged [, ] elf.ethnicc Ethnic fractionalization: low {, } elf.ethnicc2 Ethnic fractionalization: medium {, } elf.ethnicc3 Ethnic fractionalization: high {, } elf.ethnicc9 Ethnic fractionalization: missing {, } elc.eleth Salient elite ethnicity: majority rule {, } elc.eleth2 Salient elite ethnicity: minority rule {, } elc.eliti Ruling elites espouse an exclusionary ideology {, } cou.tries5d Any coup attempts in past 5 years ((t 4) to (t)) {, } pit.sftpuhvl2..ln Sum of max annual magnitudes of PITF instability other than [, 4.586] genocide from past yrs ((t 9) to (t)), logged mev.regac.ln Scalar measure of armed conflict in geographic region, logged [, ] mev.civtot.ln Scale of violent civil conflict, logged [, ]

13 table S8. The performance of the NP umbrella algorithm in real data application 2. Given α =. and δ =., after randomly splitting the data into training data with a size 374 (3/4 of the observations) and test data with a size 24 (/4 of the observations) for times, we calculate the empirical type I and type II errors on the test data. The violation rates of empirical type I errors (the percentage of empirical type I errors greater than α) of different approaches are summarized in the 2nd column. The averages and standard errors of empirical type II errors in percentages are summarized in the 3rd column. Three classification methods: penalized logistic regression (penlr), random forests (RF), and naïve Bayes (NB) are considered here. For each method, we considered three ways to choose the cutoff: default, the classical approach to minimize the overall classification error; naïve, the naïve approach to choose the threshold whose empirical type I error on the training data is no greater than and closest to α; NP, the NP umbrella approach. Type I error avg (se) Type I error violation rate Type II error avg (se) penlr (default) 9.35% (.4%) 42% 3.75% (.7%) penlr (naïve) 2.53% (.22%) 94%.25% (.4%) penlr (NP) 4.48% (.%) 7% 6.47% (.9%) RF (default) 2.45% (.6%) 66% 5.72% (.8%) RF (naïve) 53.5% (.24%) %.34% (.2%) RF (NP) 5.9% (.2%) % 2.88% (.2%) NB (default).6% (.4%) 54% 4.43% (.%) NB (naïve).9% (.7%) 57% 8.32% (.62%) NB (NP).87% (.%) 4% 76.27% (.6%)

14 table S9. Input information of the nproc package (version 2..9). Argument x y method Description an n d design matrix with n observations and d covariates an n dimensional vector with binary responses The base classification method. The following options are provided in nproc package version 2..9: logistic : logistic regression penlog : penalized logistic regression with LASSO penalty, depending on the glmnet package version 2..5 svm : support vector machines, depending on the e7 package version.6.7. Arguments of the svm function can be passed from the npc and nproc functions randomforest : random forests, depending on the e7 package version.6.7. Arguments of the randomforest function can be passed from the npc and nproc functions lda : linear discriminant analysis nb : naïve Bayes, depending on the e7 package version.6.7 ada : AdaBoost, depending on the ada package version 2.-5 alpha type I error upper bound, with default value.5 delta the violation rate of type I error, with default value.5 split the number of splits M, with default value

Neyman-Pearson (NP) Classification Algorithms and NP Receiver Operating Characteristics (NP-ROC)

s Neyman-Pearson (NP) Classification Algorithms and NP Receiver Operating Characteristics (NP-ROC) (Short title: NP Classification and NP-ROC) Xin Tong, 1 Yang Feng, 2 Jingyi Jessica Li 3 1 Department