Estimating False Negatives for Classification Problems with Cluster Structure

Size: px

Start display at page:

Download "Estimating False Negatives for Classification Problems with Cluster Structure"

Shannon Parrish
6 years ago
Views:

1 Estimating False Negatives for Classification Problems with Cluster Structure György J. Simon Vipin Kumar Zhi-Li Zhang Department of Computer Science, University of Minnesota Abstract Estimating the number of false negatives for a classifier when the true outcome of the classification is ascertained only for a limited number of instances is an important problem, with a wide range of applications from epidemiology to computer/network security. The frequently applied method is random sampling. However, when the target (positive) class of the classification is rare, which is often the case with network intrusions and diseases, this simple method results in excessive sampling. In this paper, we propose an approach that exploits the inherent cluster structure of the data to significantly reduce the amount of sampling needed while guaranteeing an estimation accuracy set forth by the user. The basic idea is to cluster the data and divide the clusters into a set of strata, such that that proportion of positive instances in the stratum is very low, very high or in between. By taking advantage of the different characteristics of the strata, more efficient estimation strategies can be applied, thereby significantly reducing the amount of required sampling. We also develop a computationally efficient clustering algorithm referred to as class-focused partitioning it uses the (imperfect) labels predicted by the classifier as additional guidance in evaluating the purity of the clusters during the clustering process and also in forming the aforementioned strata. We provide a formal mathematical analysis of our estimation methodology and show that it can provide performance far superior to random sampling, when there is inherent clustering structure, and that it deteriorates into simple random sampling when the data has no relevant cluster structure. We evaluated our method on the KDDCup network intrusion data set. Our method achieved better precision and accuracy with a 5% sample than the best trial of simple random sampling with 40% samples. Keywords: false negatives estimation, random sampling, clustering 1 Introduction Determining the performance of a classifier is an important and well-studied problem. While a wealth of metrics (such as precision, recall, F-measure [46]) and techniques (e.g. bootstrapping, cross-validation, ROC curves[46, 38, 19]) have been developed for evaluating the performance of a classifier under a variety of conditions, most of them require that the true class labels are known. In many classification problems, however, establishing the true class label of all instances is not feasible. In epidemiology, very reliable classifiers, gold standard 1

2 tests, exist, but they are so intrusive that their application to the generic public without sufficient indication of the presence of the disease is deemed unethical and expensive. In case of network intrusion detection, a security expert (say, of an enterprise network) serves as a reliable classifier, who can distinguish malicious traffic behavior (positive class) from normal traffic behavior (negative class) at high confidence. However, due to the high traffic volume, it is unrealistic to expect a security expert to inspect every single traffic flow in his/her network. Instead of the gold standard test, a less expensive but imperfect classifier, such as an epidemiological screening test or a network intrusion detection system (IDS), is often employed. Positive outcome of this imperfect classifier is considered sufficient indication and the true status of these predicted positive instances are investigated. The goal of this paper is to develop a method for evaluating the performance of these imperfect classifiers. Part of the performance is already known: the true class label for all predicted positive instances is already determined. To obtain a full evaluation, however, one must also determine the number of false negative instances. A frequently used method to estimate the number of false negatives is to draw a sufficiently large sample of the instances classified as negative, determine the true class label by applying the gold standard test and compute the proportion of the positive instances in it [32]. Since in many applications, the positive class is typically rare (e.g., network intrusions and diseased outcomes in a medical screening test), and a (hopefully) large portion of them has been identified by the classifiers, only very few positive instances are left among the (negative) majority of the instances with the true class label not yet determined. Under these conditions, the sample size required to accurately estimate the number of false negatives grows excessively large. We will show later that detecting a certain class of common intrusions within a modest 20% range of error at least at 95% probability in the KDD Cup data set [1] which contains 480,000 instances, would require a sample size of over 100,000 instances. In this paper, we present a novel approach based on stratified sampling that allows for estimating the number of false negative instances more efficiently, while guaranteeing that the error stays within bounds specified by the user. The basic idea is to exploit the inherent cluster structure of the classification problem. We cluster and partition the data into distinct sets ( strata ) of clusters. Specifically, four strata are formed: (1) Negative stratum that that is hypothesized to contain clusters with a proportion of false negative (FN) instance close to (or equal to) 0, (2) Positive stratum with clusters that are hypothesized to have a FN proportion of close to (or equal to) 1, (3) Mixed stratum, with clusters that are neither pure positive nor negative and (4) Tiny stratum containing all clusters, which are whether they appear pure or not too small to yield statistically significant results alone. These strata have different statistical properties which can be exploited to develop a more efficient sampling scheme. The efficiency of our approach stems from two factors. First, in case of simple random sampling (SRS), purer clusters have smaller variability 2

3 in the proportion of FN instances across possible samples, which translates to smaller sample sizes. Second, when the Positive and Negative clusters are indeed close to pure, a more powerful (in the statistical sense) scheme the geometric model can be employed to further reduce the sample size. We provide formal mathematical analysis of our methodology and show that it provides far superior performance over simple random sampling (SRS), when the data exhibits inherent clustering structure that can be exploited. In the worst case, that is when there is no inherent cluster structure in data or when the positive class is not rare, our scheme deteriorates into SRS. As will be shown later, although our proposed estimation scheme works with K-Means clustering, K-Means clustering has a tendency to discover clusters of similar sizes, while our problem requires that small pure clusters (which we call partitions) are discovered. To address this problem, we propose a class-focused partitioning scheme that (i) is more efficient that K-Means in creating many small partitions and (ii) has the ability to incorporate the labels provided by the imperfect classifiers into the clustering process. We evaluate our proposed method on a real-world data set, the KDD Cup intrusion detection dataset. Our proposed method attains a 4-fold improvement in precision and accuracy over SRS with a sample size of 200,000 instances. Our method required a sample of only 25,000 instances. Our Contributions. The contributions of this paper is briefly summarized below. We propose a novel and efficient approach for estimating the number of false negative instances based on stratified sampling that exploits the inherent cluster structure in data to significantly reduce the number of samples needed while providing guarantees about the estimation error. This method exploits the cluster structure that underlies the data set. We develop a class-focused partitioning method that facilitates the reliable extraction of the clusters (or subclusters). The class-focused partitioning method makes use of both the predicted labels by the imperfect classifiers and the true labels for the predicted positive instances. This information is pushed into the partitioning algorithm in a manner such that possibly incorrect labels are not misleading the algorithm. 2 Related Works Methods of estimation. In empidemiology, the evaluation of new screening tests is a very important problem [38]. This explains why the majority of the related works comes from this domain. Accordingly, in the followings, we will use the terms (medical) test and classifier interchangably. In the medical domain, the goal is to establish the performance of a test based on the performance of another imperfect test. An early approach to assessing the performance of a new screening test with the use of existing tests was to 3

4 view each test as an expert and apply non-parametric methods to assess expert agreement. A number of expert agreement measures have been applied, most notably the Kappa-index [26, 51] and the McNemar test [43]. These methods however establish only the relative performance of a test, they do not assess the absolute performance. The first work to estimate the number of false negatives (and hence establish the absolute performance of a test) is the method by Goldberg and Wittes [21]. They apply the capture-recapture model [27, 32, 11]. The capture-recapture model in its original form makes the assumption that the classifiers to be evaluated are independent. This assumption is not rooted in any observation, it is a mere mathematical convenience [14]. The violation of this assumption affects the result dramatically [47]. Mane at al. showed that even moderate deviations from the independence assumption can invalidate the estimate [34]. The problem of conditional dependence has been battled on multiple fronts. Mane at al. proposed an algorithm that selects sets of attributes in a manner that ensures independence between the classifiers [35, 34]. When the classifier is designed by a third party, we do not always have the opportunity to select attributes to be used. Others have extended the capture-recapture method to use relaxed assumptions such as uniform dependence between the classifiers across the classes or uniform relative risk [11]. Latent class analysis is a popular method in epidemiological and sociological studies [42, 22]. Latent class analysis is the statistical analysis of unobservable variables. In this particular application, the unobservable variable is the true disease status. In latent class analysis for two tests, Hui et al. [28, 50] simultaneously estimate the type I and II error rates of both tests as well as the prevalence of the disease; 5 variables in total. From the application of two tests and verifying the predicted positives using the gold standard test, we gain 5 degrees of freedom. The problem is hence exactly identifiable [51], or conversely we have no degrees of freedom to estimate the correlation between the tests. For this reason, the problem of estimating the false negatives for two dependent classifiers is an under-defined problem[51]. The problem becomes tractable for three or more tests [51]. The latent class analysis approach has been extended to handle correlation between the classifiers. The dependence between classifiers have been modeled by additional latent classes [40], random effects [25, 41] or by using latent class loglinear modeling and assuming independence between pairs of tests [23, 53]. All these models require more than 2 tests. Latent class analysis, with the true disease status being the unobservable variable, especially in the case where the number of classes has been increased to account for dependence among the classifiers suffers from difficult interpretation. This shortcoming has been addressed by introducing loglinear models. Loglinear (or logistic) models for the two-test problem are also under-defined if interaction between the classifiers are to be taken into account. However they can be used to estimate relative performance [39, 13]. Similarly to the latent class modeling approach, in order to estimate the 4

5 absolute performance, additional information needs to be included. Alonzo et al. looked at two-phase studies, where the subjects are screened multiple times [3]. Lloyd at al. has performed the same tests in succession multiple times and counted the number of positive results. These results formed the basis for the model [30, 31]. Clustering, Semi-supervised clustering. Clustering is the problem of grouping objects into clusters such that objects within a cluster are more similar to each other than to objects in other clusters [46]. Arguably one of the most popular clustering methods is k-means [33]. K- means partitions the set of objects into k disjoint subsets under the constraint that the sum squared error is to be minimized. K-means is simple, fast (has a complexity of O(kn), with n being the number of objects and k denoting the desired number of clusters), but suffers from a number of drawbacks. It tends to discover spherical clusters of similar sizes and it has no mechanism to handle outliers. Also, k-means requires that the user specifies the number of clusters a priori. Furthermore, the clustering results are very susceptible to the initialization of the algorithm. Bisecting k-means [45] is an improvement over k-means providing a lesser degree of dependence on the initialization. Bisecting k-means initially considers the entire data set as a cluster and it iteratively bisects the least homogeneous cluster. Bisecting k-means in the end produces a hierarchy of clusters. DBSCAN [20], a density-based clustering algorithm takes a different approach. Essentially it finds regions with high density of objects. Some objects, called outliers, will not belong to any cluster. While DBSCAN can find arbitrarily shaped clusters and detect outliers, it has a higher computational cost. While spatial indices can reduce the complexity to O(n log(n)), in high dimensions, spatial indices fail, and the complexity becomes closer to O(n 2 ), where n denotes the number of objects to cluster. As far as user-specified parameters are concerned, DBSCAN automatically discovers the number of clusters, but the user still needs to specify the minimal density that qualifies as a cluster. Since the minimal cluster density needs to be specified, clustering using local density (i.e. find clusters whose density is higher than the density around them) is not possible. OPTICS[4] is an extension of DBSCAN for visualizing clusters based on density. Using OPTICS, a human user can detect varying cluster densities, but it is still insufficient for automatic density-based partitioning-type cluster detection. To our problem, where a large number of unlabeled instances and only a few labeled instances exist, semi-supervised learning [54, 44, 24] appears applicable. In particular, semi-supervised clustering [24] exhibits much resemblance. Semi-supervised clustering approaches make use of labeled instances in multiple ways. Labels (or user-defined constraints in general) can be used to automatically learn a suitable distance measure [29, 15, 52], objective function [18, 16] or they can be used to guide the assignment of objects to clusters [48]. 5

6 These techniques are not mutually exclusive, some recent methods provide a unified solution [9, 10]. Another way to use the labels is to perform dimensionality reduction before clustering [5]. Although many clustering algorithms (complete link [29], other agglomerative hierarchical techniques [17], COWEB [49]) have been extended to make use of labels, k-means (and its generic version, EM [9]) is the most popular method ([8, 48, 7, 37] and many others). K-means, with its dependence on the initial centroids gives rise to a novel use of the labels: one can find better centroids using labeled data [6]. Semi-supervised clustering methods in general are applied when the problem has clear structure, but this structure may be difficult to discover it without some assistance from the user. In our case, we do not need to discover the exact structure of the data. We use labels to verify the correctness of the classification algorithms that generated the labels. In fact, if we directly incorporated the labels into the clustering process in the form of adjustments to the distance function or the assignment of the instances to clusters, our argument would become circular. Moreover, the labels (or constraints) in semi-supervised clustering are assumed to be correct. In our case, the labels are imperfect. 3 Problem Formulation We consider a classification problem, where the instances of a population D are classified as having either positive (+) or negative ( ) class label. Our goal is to evaluate the performance of a classifier T, when the true labels are not known for all instances. An instance is predicted positive, if T predicts it to have positive label; otherwise it is predicted negative. We have an infallible expert determine the true label of the predicted positive instances. A predicted positive instance is true positive (TP) if its true label is positive, otherwise it is false positive (FP). Analogously, a predicted negative instance is a false negative(fn) (resp., true negative) if its true label is positive (resp., negative). Goal. Derive an estimate ˆt for the number of false negative instances given the predicted labels and the true labels for the predicted positive instances such that [ Pr ˆt t ˆt < ε ] = 1 α, where ε is the margin of error and α is the significance level. Both ε and α specified by the user. In the experiments, we will use α = 5% and ε = 20%. One way to estimate d is to use simple random sampling (SRS). In SRS, we take a number of samples from the predicted negative instances, have the infallible expert determine the true labels of the instances in the samples and compute the proportion of false negative (FN) instances in the samples. From this information, we can infer the number of FN instances in the population of predicted negative instances. 6

7 When the positive class is rare (i.e., the number of instances with + true labels is very small relative to the total size of the dataset), the required sample size becomes excessively large. Let us consider the concrete example of the KDD Cup intrusion detection data set. The positive class accounts only for 1% of all instances. While 85.99% of the positive instances can be identified by our classifier, the remaining 575 positive instances are hidden among the approximately 490,000 predicted negative instances. Using SRS, to obtain a good estimate of these 575 false negative instances within 20% error of margin at strictly 95% or higher probability requires a sample size of 200,000, which is about 40% of the total population 1! To develop a more efficient method for estimating the false negative instances that do not require excess sampling while attaining good estimation accuracy, in this paper, we propose an innovative approach that exploits the inherent cluster structure underlying the classification problem by partitioning the data and by dividing the partitions into four distinct strata each featuring certain statistical properties that make more efficient sampling strategies applicable. In the following section, we will show how such a partitioning can be attained, and how such a partitioning can lead to an efficient estimation method that significantly reduces the amount of sampling while attaining the user-specified accuracy bound. 4 Proposed Approach As discussed above, our goal is to divide the data into groups which are likely to be pure: either all of the predicted negative instances in the group are positive or negative. Intuitively, this can be attained, because the population adheres to a limited number of modes of behavior. Some example modes of behavior in the network data are web and SMTP ( ) traffic in a data communications network. Instances that share the same mode of behavior are assumed to be more similar to each other than instances that have a different mode of behavior. Our method aims to discover the above modes of behaviors and to exploit them to the benefit of the estimation process. To this end, we employ a classfocused partitioning technique that discovers partitions of the population where estimating the number of false negative instances is easy. Estimation is easy, when the proportion of false negative instances in a partition is close to 0 or close to 1. In other words, partitioning can be thought of as focusing the positive instances into a set of partitions and focusing the negative instances into a different set of partitions. Unfortunately, purity, which is measured by the true proportion of false negative instances in a partition, can not be directly observed. Instead, we can assume that a partition is pure, if all of the following conditions hold true: (a) the partition is tight, that is, it has a high intra-cluster similarity; (b) it 1 Obtained from simulation. If we knew the true proportion of FN instances, theoretically, 96,000 would be enough. 7

8 is observed pure, namely the imperfect labels indicate that all instances in a partition are either true positives or false positives (i.e. the partition does not contain both true and false positives). and (c) it is nearest-neighbor (NN) pure: its nearest neighbor is also observed pure and the two partitions share the same labels. The figure on the left illustrates the motivation for NN purity. The positive instances in the two negative partitions would be difficult to detect but their presence can be suspected from the vicinity of the large positive partition. Once these tight, pure partitions are discovered, the number of false negative instances can be determined by our modified stratified sampling scheme. In the followings, we describe the class-focused partitioning algorithm and our estimation scheme in detail. 4.1 Class-Focused Partitioning The class-focused partitioning technique aims to discover partitions that are pure. Purity cannot be observed, so we heuristically approximate it. We consider a partition pure if it is tight, observed-pure and large enough (i.e. not tiny). Formal definitions for these properties follow. Property 1 (Tightness) We call a partition tight, if for its mean squared error (MSE), MSE < minmse holds, where minmse is a user-defined threshold. MSE measures the average dissimilarity of the partition by comparing each instance to the partition mean. MSE does not penalize large partitions as long as their instances are sufficiently similar to each other. Property 2 (Tininess) We call a partition tiny if it is so small that even if it contains positive instances, the classifier is unlikely to discover them. Tininess is related to the recall of the classifiers, which in turn requires that the false negative rate is known. Consequently, a precise threshold is not possible to derive. Instead we use a heuristic to determine a global threshold s: Pr[predicted positive actual positive]s = a θn s < 1, where a denotes the true positives, N the total number of instances and θ the assumed 2 prevalence of the positive class. Property 3 (Purity) We call a partition pure, if most (or preferably all) instances share the same mode of behavior. From the information we have available, we can not measure purity. Instead, Observed purity is measured using the imperfect labels. A partition is observedpure if for any of the classifiers TPR > 0 and FPR = 0; or if TPR = 0 and FPR 0 3. Observed purity does not imply purity (e.i. non-observed purity), 2 θ does not need to be precise. Tiny partitions will be pooled together, so as long as s is small, it will have no impact. 3 The lack of true and false positive instances can be indicative of the lack of FN instances 8

9 however they are correlated. A partition P is called nearest-neighbor-pure (NN-pure) if P is tight, not tiny and observed-pure and the partition whose centroid is closest to the centroid of P is tight, observed-pure, not tiny and consists of instances that share the same label as the instances in P The Algorithm. The class-focused partitioning technique extends bisecting k-means. Initially, it considers the entire data set D as one cluster. Then it iteratively partitions it into two smaller partitions until the stopping condition evaluates to true. Once the stopping condition is met, the partition is set aside for estimation. Stopping Condition. The purpose of the class-focused partitioning is to create partitions where estimating the number of FNs is easier. As discussed before, estimation is simple, when the partition can be assumed to be almost pure, that is, it is tight, observed-pure, NN-pure and not tiny. Bisecting stops, when such a stage is achieved. Moreover, bisecting also stops, when further partitioning is not beneficial. Partitioning is not beneficial, when (a) the partition contains no instances with unknown labels, (b) the partition is already tiny or (c) bisecting the partition does not result in improvement in terms of either observed purity or tightness. For each partition, let n denote the total number of instances, u the number of instances with unknown class labels and s the threshold for tininess. Let mse denote the MSE of the partition and let minmse be a user-defined minimum MSE threshold. The partition is not bisected, if any of the following conditions holds. 1. If u = 0, then the class label of all instances is known. 2. If u < s, then the partition is tiny. 3. If mse < minmse, then the partition is tight. 4. If bisecting the partition results in no improvement. Let m p denote the MSE of the parent partition and m 1 and m 2 denote the MSE of the child partitions. Let t p denote the true positive rate (TPR) of the parent and let t 1 and t 2 denote the TPR of the two child partitions. There is no improvement in tightness if ( mp m ) ( p t m 1 m 2 t p t ) 2 1. t p There is no improvement if the density of all instances as well as the density of positive instances is uniform over the partition. In that case, by bisecting the partition, the average distances is expected to become half, and the MSE become 1/4. Hence the factor of 4. The true positive rate is expected to remain unchanged. We allow for a.5 fluctuation. 9

10 4.2 Outline of the Estimation. Once the partitions are created, we form strata on which our stratified-samplingbased estimation is carried out. We shall show that estimation using these strata is correct (i.e. the estimate satisfies the error bound defined in Section 3) and more efficient than just applying simple random sampling (SRS). Conceptually, we create 3 strata: (i) Positive that is relatively rich in FN instances, (ii) Negative, which contains relatively few FN instances and (iii) Mixed containing the rest of the partitions. The Negative stratum is assumed to contain no false negative instances, while the Positive stratum is assumed to contain no true negative instances. Then estimation proceeds as follows. Let ˆt denote the number of FN instances in a stratum. Since the Negative stratum is assumed to contain no false negatives, we estimate ˆt = 0. The purpose of sampling here is to prove that there are not too many 4 false negative instances in the stratum w.r.t. the predefined error bounds. We employ the geometric model (described later in Section 4.4.1) to this end. If we happen to find too many false negatives during the sampling, then we revert back to SRS in this stratum. With the Positive stratum, we proceed similarly. This stratum is assumed to contain only positive instances, hence ˆt = u, where u denotes the number of instances in this stratum with unknown true label. Here we employ the geometric model to prove that indeed there are not too many true negatives in this stratum. If too many true negatives are found, the number of FN instances in this stratum is estimated via SRS. Finally, for the Mixed stratum, we simply employ SRS. Our approach is correct in the sense that the estimate will be within the error bounds. This is guaranteed for SRS. We only deviate from SRS, when the stratum is observed to be pure. Should we discover that it is not sufficiently pure, we revert back to SRS. The worst-case scenario is when the data set is uninformative, i.e. it does not contain relevant clusters. Some partitions could appear (by chance) pure negative, others pure positive. During the application of the geometric model, we will discover (at 1 α probability as it will be clear in Section 4.4.1) that these partitions are not clear and return to SRS. The effectiveness of our approach stems from two factors. First, when the partitions are observed pure, then the geometric model is applicable 5. Second, even when the partitions are found to be insufficiently pure, the sample size required to obtain estimates using SRS decreases with increasing purity within a fixed error bound (See Equation 5). This is demonstrated in the following examples. Example 1 FN estimation with relative error in the population. Let us consider a population of 100,000 instances with unknown labels. The goal is to estimate the π proportion of FN instances within 20% error at 95% 4 i.e. the error is within the pre-define error at 1 α probability. In the rest of the paper, we will use too many in this sense. 5 SRS is not even applicable when the proportion is 0 or 1. The variance of the estimate becomes 0 and alongside, the sample size becomes 0, too. 10

11 probability. Figure 1 depicts the required sample size as function of π. The results are obtained from simulation: for each π {1/1000,..., 999/1000}, we tested sample sizes of s {1%, 2%,..., 99%}. For each π and for each s, we ran 100 trials and for each π we determined the smallest s such that at least 95 of 100 trials resulted in an estimate ˆπ of π within 20% error. Figure 1 depicts this minimal s for each π. Sample size required to estimate the FN proportion in the population Relative Error of 20\%, Population=100, Required Sample Size [Percent] True FN proportion in the population Figure 1: The minimal sample size required to estimate a proportion within a certain (relative) bound of error. Figure 1 shows that the sample size required to perform estimation within a certain among of relative error, i.e. when the error is a function of the estimate such as 20% of ˆπ, increases as the proportion of FN instances decreases. Example 2 FN estimation with absolute error in the stratum. Let us assume the the proportion π of FN instances in the population is fixed, and let ˆπ be a fixed estimate of π. Let the error ǫ be 20% of ˆπ ; ǫ is also fixed. Next, we create 3 strata. We distribute the FN instances among these 3 strata arbitrarily. We will concentrate on one of the 3 strata, which we denote by N. In this example, we will control the proportion π N of FN instances in N: π N = 1/1000, 2/1000,..., 999/1000. Let us remind the Reader, that the total number of FN instances is fixed, we just change their distribution over the 3 strata. We also need to distribute the error ǫ among the 3 strata. Arbitrarily, let us assign ǫ N =.3ǫ. Let us remind the Reader, that ǫ N is absolute error, in the sense, that it does not depend on π N, it only depends on the fixed ˆπ. Then Figure 2 depicts the sample size required to estimate π N within ǫ N error at 95% probability. These sample sizes were determined by simulation identically to Example 1. Figure 2 shows that the sample size decreases with increasing purity (i.e. the increasing.5 π N ) of the partition. This observation is contrary to the 11

12 Sample size required to estiamte the FN proportion in the stratum Absolute error of 30, Population of 100, Sample size [percent] FN proportion in the stratum Figure 2: The minimal sample size required obtain an estimate of a proportion within a bound of absolute error. observation in Figure 1. The reason lies in the use of relative error, where the error depends on the quantity to be estimated, versus the use of absolute error, where the error does not depend on the quantity to be estimated. In other words, our approach becomes increasingly efficient as the strata become purer. When the proportion of FN instances in all three strata are equal and coincide with the proportion of FN instances (i.e. the clustering is totally meaningless), we perform equivalently to SRS. Our gain increases as the Negative stratum becomes purer, that is our gain increases as the difference between the proportion of FN instances in the Negative stratum and the proportion of FN instances in the entire population increases. Our gain also increases as the Positive stratum becomes increasingly pure. Accordingly, our goal is to increase the size of the Negative stratum whilst minimizing the proportion of FN instances in it and to increase the size of the Positive stratum whilst maximizing the proportion of FN instances in it. This decreases the size of the Mixed stratum, where no gain is registered. Estimation process. 1. The strata are formed as described in Section An initial estimate of the FN proportion ˆπ (0) is computed from the number of True Positive instances and a small sample drawn from the Mixed strata. 3. A sampling plan is created. Given the total allowed error, the sampling plan determines how much of the total allowed error is allocated to each stratum such that the total required sample size is minimal. 4. The sampling plan is executed. The outcome of the execution is an updated (more precise) FN proportion estimate (ˆπ (1) ) and its achieved error. 12

13 If the achieved error is less than maximal allowed error, then the estimation is complete and the final estimate is ˆπ (1). Otherwise, the estimate needs to be refined. A new total allowed error is computed from ˆπ (1) and the process repeats from Step 3. In the subsequent sections, we elaborate on these steps. 4.3 Strata Formation. In this section, we focus on Step 1 of the Estimation Process. The input to the strata formation step is the set of partitions or clusters. First, all partitions/clusters that have no unknown instances are discarded they can not contain false negative instances. Next, four strata are formed. Conceptually, they are the same as above, but they lead to increased efficiency. Tiny Positive Negative Mixed All tiny partitions All tight, not tiny, observed-pure and NN-pure partitions with TPR > 0 and FPR = 0. All tight, not tiny, observed pure and NN-pure partitions with T P R = 0. All remaining instances. In case of the Negative category, there is strong indication that the partition is pure, when TPR=0 and FPR=0, there is no indication that the partition is indeed negative. It is possible, that the partition is pure positive, but our classifier failed to discover all of the positive instances in the partition. We take a small sample to ensure that the partition is not positive in its entirety. If it is, then it is moved to the Mixed stratum. 4.4 Estimating the Number of FN Instances. Once the strata are formed, stratified sampling is carried out to determine the number of false negative instances. Let T, P, N and M denote the Tiny, Positive, Negative and Mixed strata, respectively. For each stratum h, h {T, P, N, M}, let N h denote the number of instances with unknown label, ˆt h the estimate of the number of FN instances, ˆπ h the estimated proportion of false negatives among the N h instances with unknown label (ˆt h = N hˆπ h ), e h the estimation error in terms of number of instances and ǫ h the estimation error in terms of proportion of instances with unknown label (i.e. e h = N h ǫ h ). The final estimate for the number of FN instances ˆt and its error e can be computed as t = ˆt h, e = e h. (1) h {T,P,N,M} h {T,P,N,M} First, in Step 2 of the Estimation Process, an initial crude estimate ˆπ (0) of the proportion of FN instances is computed from the number of pure positive 13

14 instances, and by taking a small sample from the Tiny and Mixed strata (which results in ˆπ (0) M and ˆπ(0) T, respectively; ˆπ N is assumed 0). ˆπ (0) = N P h {T,P,N,M} N h + ˆπ (0) M + ˆπ(0) T. Next, the total allowed error is computed: ǫ 0 = εˆπ (0), where ε is defined by the user as described in Section 3. In Step 3, a sampling plan is created. The sampling plan is an optimal allocation of the error to the four strata. Formally, given ǫ 0, we need to determine ǫ T, ǫ P, ǫ M and ǫ N, such that min ˆn h s.t. ǫ 0 = ǫ h, ǫ P,ǫ M,ǫ N,ǫ T h h where ˆn h denotes the estimated size of required sample from stratum h. The allocation algorithm will be explained in Section In Step 4, the sampling plan is executed. This entails determining the sample size required to estimate ˆt h within the margin of error ǫ h, drawing the sample and determining the true labels. Then ˆt h is computed. For strata T and M, we will use stratified (simple) random sampling (SSRS) as we will describe in Section For stratum P, ˆt P = N P and a sample is drawn only to verify that the stratum is indeed pure positive. Similarly, for N, ˆt N = 0 and a sample is drawn to ascertain that the stratum is indeed pure negative. To this end, the geometric model is employed Geometric Model for P and N. Let us consider the N stratum; estimation in P works analogously. As we discussed before, in stratum N, all instances are assumed negative. The goal here is to probabilistically prove, that there are no false negative instances in the stratum. More precisely, we shall show that if we draw a sample of size z (as determined by Equation 3), and we find 0 FN instances in that sample, then the estimate ˆt N = 0 is correct within an error bound of ǫ N at α probability. N can contain 2 types of instances with unknown labels: (i) predicted negatives that are true negatives and (ii) false negatives that are actually positives. Let us assume that there are x false negatives instances and N N x true negative instances. We need to show, that if x N N ǫ N, then by drawing a sample of size z, there will be 0 false negative instances in the sample at most at probability of α. ( NN x z ) ( NNz ) ( NN N N ǫ N z ) ( NN x z ) ( NNz ) < α ( NNz ) (1 ǫ N ) z < α (2) 14

15 This result is related to the geometric distribution. Let Z be a random variable denoting the number of trials before we successfully find a FN instance under the above conditions 6. Then Z follows geometric distribution with p.d.f. Pr[Z = z; ǫ N ] = (1 ǫ N ) z. Then we need to prove Pr[Z = z; ǫ N ] < α, which results in z log α log(1 ǫ N ) An analogous estimation can be applied to the P stratum except we need to find true negatives instead of false negatives. Adjusting the Estimate. Assume that for the N stratum, x (x > 0) FN instances were discovered during the sampling. In such a case, we revert back to estimation using SSRS (Section 4.4.2). We can use ˆπ N = x/z as the proportion of FN instances. True negatives in P can be processed analogously. This is still advantageous because the sample size required by SSRS decreases with greater purity of the N and P partitions Stratified (Simple) Random Sampling for T and M. Let y hi denote the true label of instance i in stratum h, h {T, M} { 1, if instance i is false negative y hi =. 0, otherwise The population mean corresponds to the proportion of false negatives in stratum h N h ȳ h = π h = y ih /N h, where N h is the number of instances in the stratum. When we take a sample of n h instances, we can estimate the population mean from the sample mean i=1 n h ˆȳ h = ˆπ h = y ih /n h. The estimate for the number of false negative instances in the stratum is i=1 ˆ t h = N hˆπ h. The population variance S 2 in stratum h is estimated from the sample variance s 2, which is s 2 h = 1 n 1 n h i=1 (y ih ˆȳ h ) 2 = n h n h 1 ˆπ h(1 ˆπ h ). 6 It is reasonable to assume that the probabilities in the stratum are equal (namely 0 or very small) and that the instances are independent (3) 15

16 An unbiased estimator of the variance of yˆ h is Var[ˆȳ h ] = Var[ˆπ h ] = ( 1 n h N h ) s 2 h n h = ( 1 n N ) ˆπ h (1 ˆπ h ). (4) n h 1 The variance of yˆ h is related to the estimation error through the confidence intervals. The estimation error as a proportion is the half-length of the 1 α confidence interval ǫ h = z α/2 Var[ˆȳ h ]. (5) One can also express the estimation error in terms of number of instances e h = ǫ h N h. Estimation of the required sample size Next, let us derive the sample size required for estimating the number (or proportion) of false negative instance within an error proportion of ǫ 0 at a probability 1 α: [ ˆȳ ȳ ] Pr < ǫ 0 = 1 α. ȳ We know that the error (as a proportion) is Solving for n yields ǫ h = z α/2 ( 1 n h N h ) S n. n h = z 2 α/2 S2. (ǫ 0 ȳ) 2 + z2 α/2 S2 N h An estimate for the required sample size to be taken from stratum h is then ˆn h = z 2 α/2 s2 h (ǫ 0ˆȳ h ) 2 + z2 α/2 s2 h N h z 2 α/2 ˆπ h(1 ˆπ h ) (ǫ 0ˆπ h ) 2 + z2 α/2 ˆπ h(1 ˆπ h ) N h. (6) Optimal Allocation of Instances for Sampling. In this section, we consider the question of how many instances should be sampled in each stratum to achieve the required margin of error with the least number of instances sampled. This problem is equivalent to distributing the possible error between the strata. Let e h denote the number of errors allocated to stratum h. Also let ˆn h denote the estimated sample size for stratum h. The total number of instances to be sampled is given by ˆn + = ˆn T }{{} Eq. 6 + ˆn M + ˆn P }{{} Eq. 6 }{{} Eq. 3 + ˆn N }{{} Eq. 3 16

17 = + + z 2 α/2 ˆπ T (1 ˆπ T ) (ǫ T ) 2 + z2 α/2 ˆπT (1 ˆπT ) N T z 2 α/2 ˆπ M(1 ˆπ M ) (ǫ M ) 2 + z2 α/2 ˆπM(1 ˆπM) N M log α log(1 ǫ P ) + log α log(1 ǫ N ). The optimal allocation is defined by argmin ǫt,ǫ M,ǫ N,ǫ P ˆn +, s.t. h {T,P,N,M} ǫ h = ǫ 0. (7) We do not have a closed form solution to Eq. 7; we solve it numerically. ˆπ T, ˆπ M are determined from an initial small sample, ˆπ P is assumed 1 and ˆπ N is assumed 0 a small sample would not be useful to estimate ˆπ P and ˆπ N. Given the partitioning and the strata, this allocation is optimal if h, π h = ˆπ h. Since the estimated error of the proportion ˆπ h is known before the sample allocation is performed (Eq. 5), a more precise ˆπ h can be computed from a larger sample, should the error of ˆπ h be too large. 4.5 Example. Let us illustrate the estimation approach through an example. For simplicity, we use 3 strata. In our hypothetical problem, let us assume that there are 101,000 instances with unknown labels, out of which 500 are false negatives. 100,000 instances belong to the Negative stratum, 490 to the Positive and 510 to the Mixed. Out of the 510 Mixed instances, 10 are false negatives. SRS even if it knew the true proportion of FN instances before the estimation, which is a very strong assumption would still require a sample of 16,207 instances (Equation 6 with ˆπ = 500/101000, ǫ = 100/101000) to arrive at the correct estimate ˆd, s.t. ˆd 500 < 100 at 95% probability (π denotes the fraction of FN instances among the instances with unknown label). Using our method, we proceed as follows. As Equation 6 shows, we need a crude estimate of the proportion ˆπ of FN instances among the instances with unknown label inside the stratum to obtain the sample size 7. There are 490 pure positives and let us assume that using a small 20% sample, we discover 5 of the 10 false negative instances in the Mixed stratum. We also assume that the Negative stratum contains no false negatives. Our estimate then is 495 (instead of the true 500). To stay within a 20% error, our margin of error is 20% 495 = 99, or expressed as a proportion of the unknown instances ǫ = 99/ = Since the estimation must be within an ǫ margin of 7 This estimate offers no guarantees and can well be wrong. The purpose of the sampling is to refine this estimate until it gets within the error bounds. 17

18 error, ǫ must be divided between the 3 strata. Arbitrarily 8, we allocate error to Negative, to Mixed and 0 to Positive. Then our total sample size is 6,111 for the Negative (geometric model; Equation 3, ǫ =.00049), 490 for the Positive (at 0 error, we have to determine the true label of all of them) and 508 (Equation 6, ˆπ = 5/510, ǫ =.00049) for the Mixed stratum. This approach results in a total of 6, sampled instances, which is less than the 16,000 for SRS. Since we applied SRS to the Mixed stratum, and did not sample the Positives at all, the gain comes from drawing a small sample from the Negatives. Although in this example, this is not possible, let us assume that we find 3 false negative instances 9 in the sample we drew from the Negatives. Then we revert back to SRS. The sample size required now is 7,277 (Equation 6 with ˆπ = 3/6111, ǫ =.00049) we have to draw an additional sample of size =1166. The total sample now is , which is still considerably smaller than just using SRS. 5 Evaluation We evaluated our algorithm on the KDD Cup 99 Network Intrusion data set. The data set contains 494,021 instances. For each instance 41 features are measured, out of which 34 are continuous and 3 are binary. We ignored the remaining categorical variables. Although our method can work with arbitrary distance functions, for simplicity, we use Euclidean distance. Properties of various distance functions in clustering is a well-studied area and the suitability of distance functions for different intrusion detection tasks is out of the scope of this paper. As preprocessing, we standardized that data set. For each attribute i, the value a ij for record j was computed as aij E[ai ] SE[a i ]. Now, intuitively, each value indicates how far a measurement is from the average for that attribute in terms of standard deviations. The instances in the data set belong to 20 attack classes and one class representing the normal traffic. In the original KDD Cup task, algorithms were required to detect 4 types of attack categories: scans, DoS, user-to-root and remote-to-local. The true class labels are available with the data set, but naturally we did not make them available to the algorithm. We used them as the gold standard test to determine the true label of the instances in the sample. Among the four KDDCup tasks, we believe that scans can be detected using only the continuous and binary variables, hence we will demonstrate our method by estimating the performance of a scan detector. Out of the 20 attack classes, 4 pertain to scans. The total number of scanning instances is 4,107, which is less than 1% of the total number of instances. We created the scan detector using Ripper [46]. It detected 86% of all scans. 8 This allocation of errors is obviously not optimal, it serves only illustrative purposes. 9 With only 2 false negatives, we are still within the required error bounds, so we do not need to take further action. 18

19 Evaluating the Estimates. Let FN ˆ denote the estimate for the number of FN instances; let FN denote the actual number of FN instances (the amount we need to estimate); and let E[ FN] ˆ and Var[ FN] ˆ denote the expected value and variance of the estimate, respectively. Then we will evaluate the estimates using the following measures. Bias = E[ FN] ˆ FN Precision = Var[ FN] ˆ = E[( FN ˆ E[ FN]) ˆ 2 ] Accuracy = MSE[ FN] ˆ = E[( FN ˆ FN) 2 ] Bias shows how far the estimate is from the real value and it also shows whether the estimate is an over- or underestimate. Precision is indicative of the variability in the estimate and accuracy shows how close the estimate to the real value is in absolute (squared) terms. For both precision and accuracy, higher values indicate lower precision and accuracy. To avoid confusion, we will use the term better or worse precision (and accuracy). 5.1 Estimation via Simple Random Sampling (SRS). In this experiment, we took samples of sizes 10,000, 20,000,..., 200,000 and estimated the number of false negatives from these samples. For each sample size, the estimation was performed on 100 different samples. Figure 3 depicts the estimates. The 100 results for each sample size (along the horizontal axis) are represented by a box plot. The top and bottom of the box denote the upper and lower percentiles and the middle line corresponds to the median. The whiskers extend out to the minimal and maximal estimates Estimating the Number of FNs Using SRS FN Estimate ,000 50, , ,000 Sample Size Figure 3: FN Estimation via SRS Strictly speaking, the estimates are within 20% error at 95% probability only for a sample size of 200,000, but they already get close (within 20% error at 93% probability) at a sample size of 100,000. Table 1 contains summary statistics 19

20 (minimal value, lower quartile, median, upper quartile and maximal value) of the bias, precision and accuracy of SRS over the 19 different sample sizes. The estimates appear reasonably unbiased, but they have a large variance. Table 1: Statistics of the bias, precision and accuracy of SRS over the different sample sizes. Denoted are the minimal observed value (Min.), the lower quartile (LQ), the median, the upper quartile (UQ) and the maximal observed value (Max.). Metric Min. LQ Median UQ Max. Bias Prec Accy Estimation via the Proposed Method. Our method has two probabilistic elements: (i) the actual partitioning and (ii) sampling for estimation. Accordingly, we performed 10 trial s for each setting. In each trial, we perform the entire estimation process from the partitioning/clustering to the sampling for estimation. First, we compare the estimate obtained by our method to the estimate obtained from SRS utilizing the full sampling apparatus. For SRS, the sample size of 100,000 was selected, because that is the smallest sample size where the estimates are within the error bound with (almost) the required probability. For the proposed scheme, we selected minm SE =.05. Section will explain how this value was selected. Table 2: Comparison of the proposed scheme (minmse =.05) with SRS Method CFP SRS Esimate Bias Precision Accuracy Lower Qartile Median Upper Quartile Sample Size Lower Qartile 24, ,000 Median 25, ,000 Upper Quartile 25, ,000 20

Estimating False Negatives for Classification Problems with Cluster Structure

Estimating False Negatives for Classification Problems with Cluster Structure György J. Simon Vipin Kumar Zhi-Li Zhang Department of Computer Science, University of Minnesota {gsimon,kumar,zhzhang}@cs.umn.edu