Estimating False Negatives for Classification Problems with Cluster Structure

Size: px
Start display at page:

Download "Estimating False Negatives for Classification Problems with Cluster Structure"

Transcription

1 Estimating False Negatives for Classification Problems with Cluster Structure György J. Simon Vipin Kumar Zhi-Li Zhang Department of Computer Science, University of Minnesota Abstract Estimating the number of false negatives for a classifier when the true outcome of the classification is ascertained only for a limited number of instances is an important problem, with a wide range of applications from epidemiology to computer/network security. The frequently applied method is random sampling. However, when the target (positive) class of the classification is rare, which is often the case with network intrusions and diseases, this simple method results in excessive sampling. In this paper, we propose an approach that exploits the inherent cluster structure of the data to significantly reduce the amount of sampling needed while guaranteeing an estimation accuracy set forth by the user. The basic idea is to cluster the data and divide the clusters into a set of strata, such that that proportion of positive instances in the stratum is very low, very high or in between. By taking advantage of the different characteristics of the strata, more efficient estimation strategies can be applied, thereby significantly reducing the amount of required sampling. We also develop a computationally efficient clustering algorithm referred to as class-focused partitioning it uses the (imperfect) labels predicted by the classifier as additional guidance in evaluating the purity of the clusters during the clustering process and also in forming the aforementioned strata. We provide a formal mathematical analysis of our estimation methodology and show that it can provide performance far superior to random sampling, when there is inherent clustering structure, and that it deteriorates into simple random sampling when the data has no relevant cluster structure. We evaluated our method on the KDDCup network intrusion data set. Our method achieved better precision and accuracy with a 5% sample than the best trial of simple random sampling with 40% samples. Keywords: false negatives estimation, random sampling, clustering 1 Introduction Determining the performance of a classifier is an important and well-studied problem. While a wealth of metrics (such as precision, recall, F-measure [46]) and techniques (e.g. bootstrapping, cross-validation, ROC curves[46, 38, 19]) have been developed for evaluating the performance of a classifier under a variety of conditions, most of them require that the true class labels are known. In many classification problems, however, establishing the true class label of all instances is not feasible. In epidemiology, very reliable classifiers, gold standard 1

2 tests, exist, but they are so intrusive that their application to the generic public without sufficient indication of the presence of the disease is deemed unethical and expensive. In case of network intrusion detection, a security expert (say, of an enterprise network) serves as a reliable classifier, who can distinguish malicious traffic behavior (positive class) from normal traffic behavior (negative class) at high confidence. However, due to the high traffic volume, it is unrealistic to expect a security expert to inspect every single traffic flow in his/her network. Instead of the gold standard test, a less expensive but imperfect classifier, such as an epidemiological screening test or a network intrusion detection system (IDS), is often employed. Positive outcome of this imperfect classifier is considered sufficient indication and the true status of these predicted positive instances are investigated. The goal of this paper is to develop a method for evaluating the performance of these imperfect classifiers. Part of the performance is already known: the true class label for all predicted positive instances is already determined. To obtain a full evaluation, however, one must also determine the number of false negative instances. A frequently used method to estimate the number of false negatives is to draw a sufficiently large sample of the instances classified as negative, determine the true class label by applying the gold standard test and compute the proportion of the positive instances in it [32]. Since in many applications, the positive class is typically rare (e.g., network intrusions and diseased outcomes in a medical screening test), and a (hopefully) large portion of them has been identified by the classifiers, only very few positive instances are left among the (negative) majority of the instances with the true class label not yet determined. Under these conditions, the sample size required to accurately estimate the number of false negatives grows excessively large. We will show later that detecting a certain class of common intrusions within a modest 20% range of error at least at 95% probability in the KDD Cup data set [1] which contains 480,000 instances, would require a sample size of over 100,000 instances. In this paper, we present a novel approach based on stratified sampling that allows for estimating the number of false negative instances more efficiently, while guaranteeing that the error stays within bounds specified by the user. The basic idea is to exploit the inherent cluster structure of the classification problem. We cluster and partition the data into distinct sets ( strata ) of clusters. Specifically, four strata are formed: (1) Negative stratum that that is hypothesized to contain clusters with a proportion of false negative (FN) instance close to (or equal to) 0, (2) Positive stratum with clusters that are hypothesized to have a FN proportion of close to (or equal to) 1, (3) Mixed stratum, with clusters that are neither pure positive nor negative and (4) Tiny stratum containing all clusters, which are whether they appear pure or not too small to yield statistically significant results alone. These strata have different statistical properties which can be exploited to develop a more efficient sampling scheme. The efficiency of our approach stems from two factors. First, in case of simple random sampling (SRS), purer clusters have smaller variability 2

3 in the proportion of FN instances across possible samples, which translates to smaller sample sizes. Second, when the Positive and Negative clusters are indeed close to pure, a more powerful (in the statistical sense) scheme the geometric model can be employed to further reduce the sample size. We provide formal mathematical analysis of our methodology and show that it provides far superior performance over simple random sampling (SRS), when the data exhibits inherent clustering structure that can be exploited. In the worst case, that is when there is no inherent cluster structure in data or when the positive class is not rare, our scheme deteriorates into SRS. As will be shown later, although our proposed estimation scheme works with K-Means clustering, K-Means clustering has a tendency to discover clusters of similar sizes, while our problem requires that small pure clusters (which we call partitions) are discovered. To address this problem, we propose a class-focused partitioning scheme that (i) is more efficient that K-Means in creating many small partitions and (ii) has the ability to incorporate the labels provided by the imperfect classifiers into the clustering process. We evaluate our proposed method on a real-world data set, the KDD Cup intrusion detection dataset. Our proposed method attains a 4-fold improvement in precision and accuracy over SRS with a sample size of 200,000 instances. Our method required a sample of only 25,000 instances. Our Contributions. The contributions of this paper is briefly summarized below. We propose a novel and efficient approach for estimating the number of false negative instances based on stratified sampling that exploits the inherent cluster structure in data to significantly reduce the number of samples needed while providing guarantees about the estimation error. This method exploits the cluster structure that underlies the data set. We develop a class-focused partitioning method that facilitates the reliable extraction of the clusters (or subclusters). The class-focused partitioning method makes use of both the predicted labels by the imperfect classifiers and the true labels for the predicted positive instances. This information is pushed into the partitioning algorithm in a manner such that possibly incorrect labels are not misleading the algorithm. 2 Related Works Methods of estimation. In empidemiology, the evaluation of new screening tests is a very important problem [38]. This explains why the majority of the related works comes from this domain. Accordingly, in the followings, we will use the terms (medical) test and classifier interchangably. In the medical domain, the goal is to establish the performance of a test based on the performance of another imperfect test. An early approach to assessing the performance of a new screening test with the use of existing tests was to 3

4 view each test as an expert and apply non-parametric methods to assess expert agreement. A number of expert agreement measures have been applied, most notably the Kappa-index [26, 51] and the McNemar test [43]. These methods however establish only the relative performance of a test, they do not assess the absolute performance. The first work to estimate the number of false negatives (and hence establish the absolute performance of a test) is the method by Goldberg and Wittes [21]. They apply the capture-recapture model [27, 32, 11]. The capture-recapture model in its original form makes the assumption that the classifiers to be evaluated are independent. This assumption is not rooted in any observation, it is a mere mathematical convenience [14]. The violation of this assumption affects the result dramatically [47]. Mane at al. showed that even moderate deviations from the independence assumption can invalidate the estimate [34]. The problem of conditional dependence has been battled on multiple fronts. Mane at al. proposed an algorithm that selects sets of attributes in a manner that ensures independence between the classifiers [35, 34]. When the classifier is designed by a third party, we do not always have the opportunity to select attributes to be used. Others have extended the capture-recapture method to use relaxed assumptions such as uniform dependence between the classifiers across the classes or uniform relative risk [11]. Latent class analysis is a popular method in epidemiological and sociological studies [42, 22]. Latent class analysis is the statistical analysis of unobservable variables. In this particular application, the unobservable variable is the true disease status. In latent class analysis for two tests, Hui et al. [28, 50] simultaneously estimate the type I and II error rates of both tests as well as the prevalence of the disease; 5 variables in total. From the application of two tests and verifying the predicted positives using the gold standard test, we gain 5 degrees of freedom. The problem is hence exactly identifiable [51], or conversely we have no degrees of freedom to estimate the correlation between the tests. For this reason, the problem of estimating the false negatives for two dependent classifiers is an under-defined problem[51]. The problem becomes tractable for three or more tests [51]. The latent class analysis approach has been extended to handle correlation between the classifiers. The dependence between classifiers have been modeled by additional latent classes [40], random effects [25, 41] or by using latent class loglinear modeling and assuming independence between pairs of tests [23, 53]. All these models require more than 2 tests. Latent class analysis, with the true disease status being the unobservable variable, especially in the case where the number of classes has been increased to account for dependence among the classifiers suffers from difficult interpretation. This shortcoming has been addressed by introducing loglinear models. Loglinear (or logistic) models for the two-test problem are also under-defined if interaction between the classifiers are to be taken into account. However they can be used to estimate relative performance [39, 13]. Similarly to the latent class modeling approach, in order to estimate the 4

5 absolute performance, additional information needs to be included. Alonzo et al. looked at two-phase studies, where the subjects are screened multiple times [3]. Lloyd at al. has performed the same tests in succession multiple times and counted the number of positive results. These results formed the basis for the model [30, 31]. Clustering, Semi-supervised clustering. Clustering is the problem of grouping objects into clusters such that objects within a cluster are more similar to each other than to objects in other clusters [46]. Arguably one of the most popular clustering methods is k-means [33]. K- means partitions the set of objects into k disjoint subsets under the constraint that the sum squared error is to be minimized. K-means is simple, fast (has a complexity of O(kn), with n being the number of objects and k denoting the desired number of clusters), but suffers from a number of drawbacks. It tends to discover spherical clusters of similar sizes and it has no mechanism to handle outliers. Also, k-means requires that the user specifies the number of clusters a priori. Furthermore, the clustering results are very susceptible to the initialization of the algorithm. Bisecting k-means [45] is an improvement over k-means providing a lesser degree of dependence on the initialization. Bisecting k-means initially considers the entire data set as a cluster and it iteratively bisects the least homogeneous cluster. Bisecting k-means in the end produces a hierarchy of clusters. DBSCAN [20], a density-based clustering algorithm takes a different approach. Essentially it finds regions with high density of objects. Some objects, called outliers, will not belong to any cluster. While DBSCAN can find arbitrarily shaped clusters and detect outliers, it has a higher computational cost. While spatial indices can reduce the complexity to O(n log(n)), in high dimensions, spatial indices fail, and the complexity becomes closer to O(n 2 ), where n denotes the number of objects to cluster. As far as user-specified parameters are concerned, DBSCAN automatically discovers the number of clusters, but the user still needs to specify the minimal density that qualifies as a cluster. Since the minimal cluster density needs to be specified, clustering using local density (i.e. find clusters whose density is higher than the density around them) is not possible. OPTICS[4] is an extension of DBSCAN for visualizing clusters based on density. Using OPTICS, a human user can detect varying cluster densities, but it is still insufficient for automatic density-based partitioning-type cluster detection. To our problem, where a large number of unlabeled instances and only a few labeled instances exist, semi-supervised learning [54, 44, 24] appears applicable. In particular, semi-supervised clustering [24] exhibits much resemblance. Semi-supervised clustering approaches make use of labeled instances in multiple ways. Labels (or user-defined constraints in general) can be used to automatically learn a suitable distance measure [29, 15, 52], objective function [18, 16] or they can be used to guide the assignment of objects to clusters [48]. 5

6 These techniques are not mutually exclusive, some recent methods provide a unified solution [9, 10]. Another way to use the labels is to perform dimensionality reduction before clustering [5]. Although many clustering algorithms (complete link [29], other agglomerative hierarchical techniques [17], COWEB [49]) have been extended to make use of labels, k-means (and its generic version, EM [9]) is the most popular method ([8, 48, 7, 37] and many others). K-means, with its dependence on the initial centroids gives rise to a novel use of the labels: one can find better centroids using labeled data [6]. Semi-supervised clustering methods in general are applied when the problem has clear structure, but this structure may be difficult to discover it without some assistance from the user. In our case, we do not need to discover the exact structure of the data. We use labels to verify the correctness of the classification algorithms that generated the labels. In fact, if we directly incorporated the labels into the clustering process in the form of adjustments to the distance function or the assignment of the instances to clusters, our argument would become circular. Moreover, the labels (or constraints) in semi-supervised clustering are assumed to be correct. In our case, the labels are imperfect. 3 Problem Formulation We consider a classification problem, where the instances of a population D are classified as having either positive (+) or negative ( ) class label. Our goal is to evaluate the performance of a classifier T, when the true labels are not known for all instances. An instance is predicted positive, if T predicts it to have positive label; otherwise it is predicted negative. We have an infallible expert determine the true label of the predicted positive instances. A predicted positive instance is true positive (TP) if its true label is positive, otherwise it is false positive (FP). Analogously, a predicted negative instance is a false negative(fn) (resp., true negative) if its true label is positive (resp., negative). Goal. Derive an estimate ˆt for the number of false negative instances given the predicted labels and the true labels for the predicted positive instances such that [ Pr ˆt t ˆt < ε ] = 1 α, where ε is the margin of error and α is the significance level. Both ε and α specified by the user. In the experiments, we will use α = 5% and ε = 20%. One way to estimate d is to use simple random sampling (SRS). In SRS, we take a number of samples from the predicted negative instances, have the infallible expert determine the true labels of the instances in the samples and compute the proportion of false negative (FN) instances in the samples. From this information, we can infer the number of FN instances in the population of predicted negative instances. 6

7 When the positive class is rare (i.e., the number of instances with + true labels is very small relative to the total size of the dataset), the required sample size becomes excessively large. Let us consider the concrete example of the KDD Cup intrusion detection data set. The positive class accounts only for 1% of all instances. While 85.99% of the positive instances can be identified by our classifier, the remaining 575 positive instances are hidden among the approximately 490,000 predicted negative instances. Using SRS, to obtain a good estimate of these 575 false negative instances within 20% error of margin at strictly 95% or higher probability requires a sample size of 200,000, which is about 40% of the total population 1! To develop a more efficient method for estimating the false negative instances that do not require excess sampling while attaining good estimation accuracy, in this paper, we propose an innovative approach that exploits the inherent cluster structure underlying the classification problem by partitioning the data and by dividing the partitions into four distinct strata each featuring certain statistical properties that make more efficient sampling strategies applicable. In the following section, we will show how such a partitioning can be attained, and how such a partitioning can lead to an efficient estimation method that significantly reduces the amount of sampling while attaining the user-specified accuracy bound. 4 Proposed Approach As discussed above, our goal is to divide the data into groups which are likely to be pure: either all of the predicted negative instances in the group are positive or negative. Intuitively, this can be attained, because the population adheres to a limited number of modes of behavior. Some example modes of behavior in the network data are web and SMTP ( ) traffic in a data communications network. Instances that share the same mode of behavior are assumed to be more similar to each other than instances that have a different mode of behavior. Our method aims to discover the above modes of behaviors and to exploit them to the benefit of the estimation process. To this end, we employ a classfocused partitioning technique that discovers partitions of the population where estimating the number of false negative instances is easy. Estimation is easy, when the proportion of false negative instances in a partition is close to 0 or close to 1. In other words, partitioning can be thought of as focusing the positive instances into a set of partitions and focusing the negative instances into a different set of partitions. Unfortunately, purity, which is measured by the true proportion of false negative instances in a partition, can not be directly observed. Instead, we can assume that a partition is pure, if all of the following conditions hold true: (a) the partition is tight, that is, it has a high intra-cluster similarity; (b) it 1 Obtained from simulation. If we knew the true proportion of FN instances, theoretically, 96,000 would be enough. 7

8 is observed pure, namely the imperfect labels indicate that all instances in a partition are either true positives or false positives (i.e. the partition does not contain both true and false positives). and (c) it is nearest-neighbor (NN) pure: its nearest neighbor is also observed pure and the two partitions share the same labels. The figure on the left illustrates the motivation for NN purity. The positive instances in the two negative partitions would be difficult to detect but their presence can be suspected from the vicinity of the large positive partition. Once these tight, pure partitions are discovered, the number of false negative instances can be determined by our modified stratified sampling scheme. In the followings, we describe the class-focused partitioning algorithm and our estimation scheme in detail. 4.1 Class-Focused Partitioning The class-focused partitioning technique aims to discover partitions that are pure. Purity cannot be observed, so we heuristically approximate it. We consider a partition pure if it is tight, observed-pure and large enough (i.e. not tiny). Formal definitions for these properties follow. Property 1 (Tightness) We call a partition tight, if for its mean squared error (MSE), MSE < minmse holds, where minmse is a user-defined threshold. MSE measures the average dissimilarity of the partition by comparing each instance to the partition mean. MSE does not penalize large partitions as long as their instances are sufficiently similar to each other. Property 2 (Tininess) We call a partition tiny if it is so small that even if it contains positive instances, the classifier is unlikely to discover them. Tininess is related to the recall of the classifiers, which in turn requires that the false negative rate is known. Consequently, a precise threshold is not possible to derive. Instead we use a heuristic to determine a global threshold s: Pr[predicted positive actual positive]s = a θn s < 1, where a denotes the true positives, N the total number of instances and θ the assumed 2 prevalence of the positive class. Property 3 (Purity) We call a partition pure, if most (or preferably all) instances share the same mode of behavior. From the information we have available, we can not measure purity. Instead, Observed purity is measured using the imperfect labels. A partition is observedpure if for any of the classifiers TPR > 0 and FPR = 0; or if TPR = 0 and FPR 0 3. Observed purity does not imply purity (e.i. non-observed purity), 2 θ does not need to be precise. Tiny partitions will be pooled together, so as long as s is small, it will have no impact. 3 The lack of true and false positive instances can be indicative of the lack of FN instances 8

9 however they are correlated. A partition P is called nearest-neighbor-pure (NN-pure) if P is tight, not tiny and observed-pure and the partition whose centroid is closest to the centroid of P is tight, observed-pure, not tiny and consists of instances that share the same label as the instances in P The Algorithm. The class-focused partitioning technique extends bisecting k-means. Initially, it considers the entire data set D as one cluster. Then it iteratively partitions it into two smaller partitions until the stopping condition evaluates to true. Once the stopping condition is met, the partition is set aside for estimation. Stopping Condition. The purpose of the class-focused partitioning is to create partitions where estimating the number of FNs is easier. As discussed before, estimation is simple, when the partition can be assumed to be almost pure, that is, it is tight, observed-pure, NN-pure and not tiny. Bisecting stops, when such a stage is achieved. Moreover, bisecting also stops, when further partitioning is not beneficial. Partitioning is not beneficial, when (a) the partition contains no instances with unknown labels, (b) the partition is already tiny or (c) bisecting the partition does not result in improvement in terms of either observed purity or tightness. For each partition, let n denote the total number of instances, u the number of instances with unknown class labels and s the threshold for tininess. Let mse denote the MSE of the partition and let minmse be a user-defined minimum MSE threshold. The partition is not bisected, if any of the following conditions holds. 1. If u = 0, then the class label of all instances is known. 2. If u < s, then the partition is tiny. 3. If mse < minmse, then the partition is tight. 4. If bisecting the partition results in no improvement. Let m p denote the MSE of the parent partition and m 1 and m 2 denote the MSE of the child partitions. Let t p denote the true positive rate (TPR) of the parent and let t 1 and t 2 denote the TPR of the two child partitions. There is no improvement in tightness if ( mp m ) ( p t m 1 m 2 t p t ) 2 1. t p There is no improvement if the density of all instances as well as the density of positive instances is uniform over the partition. In that case, by bisecting the partition, the average distances is expected to become half, and the MSE become 1/4. Hence the factor of 4. The true positive rate is expected to remain unchanged. We allow for a.5 fluctuation. 9

10 4.2 Outline of the Estimation. Once the partitions are created, we form strata on which our stratified-samplingbased estimation is carried out. We shall show that estimation using these strata is correct (i.e. the estimate satisfies the error bound defined in Section 3) and more efficient than just applying simple random sampling (SRS). Conceptually, we create 3 strata: (i) Positive that is relatively rich in FN instances, (ii) Negative, which contains relatively few FN instances and (iii) Mixed containing the rest of the partitions. The Negative stratum is assumed to contain no false negative instances, while the Positive stratum is assumed to contain no true negative instances. Then estimation proceeds as follows. Let ˆt denote the number of FN instances in a stratum. Since the Negative stratum is assumed to contain no false negatives, we estimate ˆt = 0. The purpose of sampling here is to prove that there are not too many 4 false negative instances in the stratum w.r.t. the predefined error bounds. We employ the geometric model (described later in Section 4.4.1) to this end. If we happen to find too many false negatives during the sampling, then we revert back to SRS in this stratum. With the Positive stratum, we proceed similarly. This stratum is assumed to contain only positive instances, hence ˆt = u, where u denotes the number of instances in this stratum with unknown true label. Here we employ the geometric model to prove that indeed there are not too many true negatives in this stratum. If too many true negatives are found, the number of FN instances in this stratum is estimated via SRS. Finally, for the Mixed stratum, we simply employ SRS. Our approach is correct in the sense that the estimate will be within the error bounds. This is guaranteed for SRS. We only deviate from SRS, when the stratum is observed to be pure. Should we discover that it is not sufficiently pure, we revert back to SRS. The worst-case scenario is when the data set is uninformative, i.e. it does not contain relevant clusters. Some partitions could appear (by chance) pure negative, others pure positive. During the application of the geometric model, we will discover (at 1 α probability as it will be clear in Section 4.4.1) that these partitions are not clear and return to SRS. The effectiveness of our approach stems from two factors. First, when the partitions are observed pure, then the geometric model is applicable 5. Second, even when the partitions are found to be insufficiently pure, the sample size required to obtain estimates using SRS decreases with increasing purity within a fixed error bound (See Equation 5). This is demonstrated in the following examples. Example 1 FN estimation with relative error in the population. Let us consider a population of 100,000 instances with unknown labels. The goal is to estimate the π proportion of FN instances within 20% error at 95% 4 i.e. the error is within the pre-define error at 1 α probability. In the rest of the paper, we will use too many in this sense. 5 SRS is not even applicable when the proportion is 0 or 1. The variance of the estimate becomes 0 and alongside, the sample size becomes 0, too. 10

11 probability. Figure 1 depicts the required sample size as function of π. The results are obtained from simulation: for each π {1/1000,..., 999/1000}, we tested sample sizes of s {1%, 2%,..., 99%}. For each π and for each s, we ran 100 trials and for each π we determined the smallest s such that at least 95 of 100 trials resulted in an estimate ˆπ of π within 20% error. Figure 1 depicts this minimal s for each π. Sample size required to estimate the FN proportion in the population Relative Error of 20\%, Population=100, Required Sample Size [Percent] True FN proportion in the population Figure 1: The minimal sample size required to estimate a proportion within a certain (relative) bound of error. Figure 1 shows that the sample size required to perform estimation within a certain among of relative error, i.e. when the error is a function of the estimate such as 20% of ˆπ, increases as the proportion of FN instances decreases. Example 2 FN estimation with absolute error in the stratum. Let us assume the the proportion π of FN instances in the population is fixed, and let ˆπ be a fixed estimate of π. Let the error ǫ be 20% of ˆπ ; ǫ is also fixed. Next, we create 3 strata. We distribute the FN instances among these 3 strata arbitrarily. We will concentrate on one of the 3 strata, which we denote by N. In this example, we will control the proportion π N of FN instances in N: π N = 1/1000, 2/1000,..., 999/1000. Let us remind the Reader, that the total number of FN instances is fixed, we just change their distribution over the 3 strata. We also need to distribute the error ǫ among the 3 strata. Arbitrarily, let us assign ǫ N =.3ǫ. Let us remind the Reader, that ǫ N is absolute error, in the sense, that it does not depend on π N, it only depends on the fixed ˆπ. Then Figure 2 depicts the sample size required to estimate π N within ǫ N error at 95% probability. These sample sizes were determined by simulation identically to Example 1. Figure 2 shows that the sample size decreases with increasing purity (i.e. the increasing.5 π N ) of the partition. This observation is contrary to the 11

12 Sample size required to estiamte the FN proportion in the stratum Absolute error of 30, Population of 100, Sample size [percent] FN proportion in the stratum Figure 2: The minimal sample size required obtain an estimate of a proportion within a bound of absolute error. observation in Figure 1. The reason lies in the use of relative error, where the error depends on the quantity to be estimated, versus the use of absolute error, where the error does not depend on the quantity to be estimated. In other words, our approach becomes increasingly efficient as the strata become purer. When the proportion of FN instances in all three strata are equal and coincide with the proportion of FN instances (i.e. the clustering is totally meaningless), we perform equivalently to SRS. Our gain increases as the Negative stratum becomes purer, that is our gain increases as the difference between the proportion of FN instances in the Negative stratum and the proportion of FN instances in the entire population increases. Our gain also increases as the Positive stratum becomes increasingly pure. Accordingly, our goal is to increase the size of the Negative stratum whilst minimizing the proportion of FN instances in it and to increase the size of the Positive stratum whilst maximizing the proportion of FN instances in it. This decreases the size of the Mixed stratum, where no gain is registered. Estimation process. 1. The strata are formed as described in Section An initial estimate of the FN proportion ˆπ (0) is computed from the number of True Positive instances and a small sample drawn from the Mixed strata. 3. A sampling plan is created. Given the total allowed error, the sampling plan determines how much of the total allowed error is allocated to each stratum such that the total required sample size is minimal. 4. The sampling plan is executed. The outcome of the execution is an updated (more precise) FN proportion estimate (ˆπ (1) ) and its achieved error. 12

13 If the achieved error is less than maximal allowed error, then the estimation is complete and the final estimate is ˆπ (1). Otherwise, the estimate needs to be refined. A new total allowed error is computed from ˆπ (1) and the process repeats from Step 3. In the subsequent sections, we elaborate on these steps. 4.3 Strata Formation. In this section, we focus on Step 1 of the Estimation Process. The input to the strata formation step is the set of partitions or clusters. First, all partitions/clusters that have no unknown instances are discarded they can not contain false negative instances. Next, four strata are formed. Conceptually, they are the same as above, but they lead to increased efficiency. Tiny Positive Negative Mixed All tiny partitions All tight, not tiny, observed-pure and NN-pure partitions with TPR > 0 and FPR = 0. All tight, not tiny, observed pure and NN-pure partitions with T P R = 0. All remaining instances. In case of the Negative category, there is strong indication that the partition is pure, when TPR=0 and FPR=0, there is no indication that the partition is indeed negative. It is possible, that the partition is pure positive, but our classifier failed to discover all of the positive instances in the partition. We take a small sample to ensure that the partition is not positive in its entirety. If it is, then it is moved to the Mixed stratum. 4.4 Estimating the Number of FN Instances. Once the strata are formed, stratified sampling is carried out to determine the number of false negative instances. Let T, P, N and M denote the Tiny, Positive, Negative and Mixed strata, respectively. For each stratum h, h {T, P, N, M}, let N h denote the number of instances with unknown label, ˆt h the estimate of the number of FN instances, ˆπ h the estimated proportion of false negatives among the N h instances with unknown label (ˆt h = N hˆπ h ), e h the estimation error in terms of number of instances and ǫ h the estimation error in terms of proportion of instances with unknown label (i.e. e h = N h ǫ h ). The final estimate for the number of FN instances ˆt and its error e can be computed as t = ˆt h, e = e h. (1) h {T,P,N,M} h {T,P,N,M} First, in Step 2 of the Estimation Process, an initial crude estimate ˆπ (0) of the proportion of FN instances is computed from the number of pure positive 13

14 instances, and by taking a small sample from the Tiny and Mixed strata (which results in ˆπ (0) M and ˆπ(0) T, respectively; ˆπ N is assumed 0). ˆπ (0) = N P h {T,P,N,M} N h + ˆπ (0) M + ˆπ(0) T. Next, the total allowed error is computed: ǫ 0 = εˆπ (0), where ε is defined by the user as described in Section 3. In Step 3, a sampling plan is created. The sampling plan is an optimal allocation of the error to the four strata. Formally, given ǫ 0, we need to determine ǫ T, ǫ P, ǫ M and ǫ N, such that min ˆn h s.t. ǫ 0 = ǫ h, ǫ P,ǫ M,ǫ N,ǫ T h h where ˆn h denotes the estimated size of required sample from stratum h. The allocation algorithm will be explained in Section In Step 4, the sampling plan is executed. This entails determining the sample size required to estimate ˆt h within the margin of error ǫ h, drawing the sample and determining the true labels. Then ˆt h is computed. For strata T and M, we will use stratified (simple) random sampling (SSRS) as we will describe in Section For stratum P, ˆt P = N P and a sample is drawn only to verify that the stratum is indeed pure positive. Similarly, for N, ˆt N = 0 and a sample is drawn to ascertain that the stratum is indeed pure negative. To this end, the geometric model is employed Geometric Model for P and N. Let us consider the N stratum; estimation in P works analogously. As we discussed before, in stratum N, all instances are assumed negative. The goal here is to probabilistically prove, that there are no false negative instances in the stratum. More precisely, we shall show that if we draw a sample of size z (as determined by Equation 3), and we find 0 FN instances in that sample, then the estimate ˆt N = 0 is correct within an error bound of ǫ N at α probability. N can contain 2 types of instances with unknown labels: (i) predicted negatives that are true negatives and (ii) false negatives that are actually positives. Let us assume that there are x false negatives instances and N N x true negative instances. We need to show, that if x N N ǫ N, then by drawing a sample of size z, there will be 0 false negative instances in the sample at most at probability of α. ( NN x z ) ( NNz ) ( NN N N ǫ N z ) ( NN x z ) ( NNz ) < α ( NNz ) (1 ǫ N ) z < α (2) 14

15 This result is related to the geometric distribution. Let Z be a random variable denoting the number of trials before we successfully find a FN instance under the above conditions 6. Then Z follows geometric distribution with p.d.f. Pr[Z = z; ǫ N ] = (1 ǫ N ) z. Then we need to prove Pr[Z = z; ǫ N ] < α, which results in z log α log(1 ǫ N ) An analogous estimation can be applied to the P stratum except we need to find true negatives instead of false negatives. Adjusting the Estimate. Assume that for the N stratum, x (x > 0) FN instances were discovered during the sampling. In such a case, we revert back to estimation using SSRS (Section 4.4.2). We can use ˆπ N = x/z as the proportion of FN instances. True negatives in P can be processed analogously. This is still advantageous because the sample size required by SSRS decreases with greater purity of the N and P partitions Stratified (Simple) Random Sampling for T and M. Let y hi denote the true label of instance i in stratum h, h {T, M} { 1, if instance i is false negative y hi =. 0, otherwise The population mean corresponds to the proportion of false negatives in stratum h N h ȳ h = π h = y ih /N h, where N h is the number of instances in the stratum. When we take a sample of n h instances, we can estimate the population mean from the sample mean i=1 n h ˆȳ h = ˆπ h = y ih /n h. The estimate for the number of false negative instances in the stratum is i=1 ˆ t h = N hˆπ h. The population variance S 2 in stratum h is estimated from the sample variance s 2, which is s 2 h = 1 n 1 n h i=1 (y ih ˆȳ h ) 2 = n h n h 1 ˆπ h(1 ˆπ h ). 6 It is reasonable to assume that the probabilities in the stratum are equal (namely 0 or very small) and that the instances are independent (3) 15

16 An unbiased estimator of the variance of yˆ h is Var[ˆȳ h ] = Var[ˆπ h ] = ( 1 n h N h ) s 2 h n h = ( 1 n N ) ˆπ h (1 ˆπ h ). (4) n h 1 The variance of yˆ h is related to the estimation error through the confidence intervals. The estimation error as a proportion is the half-length of the 1 α confidence interval ǫ h = z α/2 Var[ˆȳ h ]. (5) One can also express the estimation error in terms of number of instances e h = ǫ h N h. Estimation of the required sample size Next, let us derive the sample size required for estimating the number (or proportion) of false negative instance within an error proportion of ǫ 0 at a probability 1 α: [ ˆȳ ȳ ] Pr < ǫ 0 = 1 α. ȳ We know that the error (as a proportion) is Solving for n yields ǫ h = z α/2 ( 1 n h N h ) S n. n h = z 2 α/2 S2. (ǫ 0 ȳ) 2 + z2 α/2 S2 N h An estimate for the required sample size to be taken from stratum h is then ˆn h = z 2 α/2 s2 h (ǫ 0ˆȳ h ) 2 + z2 α/2 s2 h N h z 2 α/2 ˆπ h(1 ˆπ h ) (ǫ 0ˆπ h ) 2 + z2 α/2 ˆπ h(1 ˆπ h ) N h. (6) Optimal Allocation of Instances for Sampling. In this section, we consider the question of how many instances should be sampled in each stratum to achieve the required margin of error with the least number of instances sampled. This problem is equivalent to distributing the possible error between the strata. Let e h denote the number of errors allocated to stratum h. Also let ˆn h denote the estimated sample size for stratum h. The total number of instances to be sampled is given by ˆn + = ˆn T }{{} Eq. 6 + ˆn M + ˆn P }{{} Eq. 6 }{{} Eq. 3 + ˆn N }{{} Eq. 3 16

17 = + + z 2 α/2 ˆπ T (1 ˆπ T ) (ǫ T ) 2 + z2 α/2 ˆπT (1 ˆπT ) N T z 2 α/2 ˆπ M(1 ˆπ M ) (ǫ M ) 2 + z2 α/2 ˆπM(1 ˆπM) N M log α log(1 ǫ P ) + log α log(1 ǫ N ). The optimal allocation is defined by argmin ǫt,ǫ M,ǫ N,ǫ P ˆn +, s.t. h {T,P,N,M} ǫ h = ǫ 0. (7) We do not have a closed form solution to Eq. 7; we solve it numerically. ˆπ T, ˆπ M are determined from an initial small sample, ˆπ P is assumed 1 and ˆπ N is assumed 0 a small sample would not be useful to estimate ˆπ P and ˆπ N. Given the partitioning and the strata, this allocation is optimal if h, π h = ˆπ h. Since the estimated error of the proportion ˆπ h is known before the sample allocation is performed (Eq. 5), a more precise ˆπ h can be computed from a larger sample, should the error of ˆπ h be too large. 4.5 Example. Let us illustrate the estimation approach through an example. For simplicity, we use 3 strata. In our hypothetical problem, let us assume that there are 101,000 instances with unknown labels, out of which 500 are false negatives. 100,000 instances belong to the Negative stratum, 490 to the Positive and 510 to the Mixed. Out of the 510 Mixed instances, 10 are false negatives. SRS even if it knew the true proportion of FN instances before the estimation, which is a very strong assumption would still require a sample of 16,207 instances (Equation 6 with ˆπ = 500/101000, ǫ = 100/101000) to arrive at the correct estimate ˆd, s.t. ˆd 500 < 100 at 95% probability (π denotes the fraction of FN instances among the instances with unknown label). Using our method, we proceed as follows. As Equation 6 shows, we need a crude estimate of the proportion ˆπ of FN instances among the instances with unknown label inside the stratum to obtain the sample size 7. There are 490 pure positives and let us assume that using a small 20% sample, we discover 5 of the 10 false negative instances in the Mixed stratum. We also assume that the Negative stratum contains no false negatives. Our estimate then is 495 (instead of the true 500). To stay within a 20% error, our margin of error is 20% 495 = 99, or expressed as a proportion of the unknown instances ǫ = 99/ = Since the estimation must be within an ǫ margin of 7 This estimate offers no guarantees and can well be wrong. The purpose of the sampling is to refine this estimate until it gets within the error bounds. 17

18 error, ǫ must be divided between the 3 strata. Arbitrarily 8, we allocate error to Negative, to Mixed and 0 to Positive. Then our total sample size is 6,111 for the Negative (geometric model; Equation 3, ǫ =.00049), 490 for the Positive (at 0 error, we have to determine the true label of all of them) and 508 (Equation 6, ˆπ = 5/510, ǫ =.00049) for the Mixed stratum. This approach results in a total of 6, sampled instances, which is less than the 16,000 for SRS. Since we applied SRS to the Mixed stratum, and did not sample the Positives at all, the gain comes from drawing a small sample from the Negatives. Although in this example, this is not possible, let us assume that we find 3 false negative instances 9 in the sample we drew from the Negatives. Then we revert back to SRS. The sample size required now is 7,277 (Equation 6 with ˆπ = 3/6111, ǫ =.00049) we have to draw an additional sample of size =1166. The total sample now is , which is still considerably smaller than just using SRS. 5 Evaluation We evaluated our algorithm on the KDD Cup 99 Network Intrusion data set. The data set contains 494,021 instances. For each instance 41 features are measured, out of which 34 are continuous and 3 are binary. We ignored the remaining categorical variables. Although our method can work with arbitrary distance functions, for simplicity, we use Euclidean distance. Properties of various distance functions in clustering is a well-studied area and the suitability of distance functions for different intrusion detection tasks is out of the scope of this paper. As preprocessing, we standardized that data set. For each attribute i, the value a ij for record j was computed as aij E[ai ] SE[a i ]. Now, intuitively, each value indicates how far a measurement is from the average for that attribute in terms of standard deviations. The instances in the data set belong to 20 attack classes and one class representing the normal traffic. In the original KDD Cup task, algorithms were required to detect 4 types of attack categories: scans, DoS, user-to-root and remote-to-local. The true class labels are available with the data set, but naturally we did not make them available to the algorithm. We used them as the gold standard test to determine the true label of the instances in the sample. Among the four KDDCup tasks, we believe that scans can be detected using only the continuous and binary variables, hence we will demonstrate our method by estimating the performance of a scan detector. Out of the 20 attack classes, 4 pertain to scans. The total number of scanning instances is 4,107, which is less than 1% of the total number of instances. We created the scan detector using Ripper [46]. It detected 86% of all scans. 8 This allocation of errors is obviously not optimal, it serves only illustrative purposes. 9 With only 2 false negatives, we are still within the required error bounds, so we do not need to take further action. 18

19 Evaluating the Estimates. Let FN ˆ denote the estimate for the number of FN instances; let FN denote the actual number of FN instances (the amount we need to estimate); and let E[ FN] ˆ and Var[ FN] ˆ denote the expected value and variance of the estimate, respectively. Then we will evaluate the estimates using the following measures. Bias = E[ FN] ˆ FN Precision = Var[ FN] ˆ = E[( FN ˆ E[ FN]) ˆ 2 ] Accuracy = MSE[ FN] ˆ = E[( FN ˆ FN) 2 ] Bias shows how far the estimate is from the real value and it also shows whether the estimate is an over- or underestimate. Precision is indicative of the variability in the estimate and accuracy shows how close the estimate to the real value is in absolute (squared) terms. For both precision and accuracy, higher values indicate lower precision and accuracy. To avoid confusion, we will use the term better or worse precision (and accuracy). 5.1 Estimation via Simple Random Sampling (SRS). In this experiment, we took samples of sizes 10,000, 20,000,..., 200,000 and estimated the number of false negatives from these samples. For each sample size, the estimation was performed on 100 different samples. Figure 3 depicts the estimates. The 100 results for each sample size (along the horizontal axis) are represented by a box plot. The top and bottom of the box denote the upper and lower percentiles and the middle line corresponds to the median. The whiskers extend out to the minimal and maximal estimates Estimating the Number of FNs Using SRS FN Estimate ,000 50, , ,000 Sample Size Figure 3: FN Estimation via SRS Strictly speaking, the estimates are within 20% error at 95% probability only for a sample size of 200,000, but they already get close (within 20% error at 93% probability) at a sample size of 100,000. Table 1 contains summary statistics 19

20 (minimal value, lower quartile, median, upper quartile and maximal value) of the bias, precision and accuracy of SRS over the 19 different sample sizes. The estimates appear reasonably unbiased, but they have a large variance. Table 1: Statistics of the bias, precision and accuracy of SRS over the different sample sizes. Denoted are the minimal observed value (Min.), the lower quartile (LQ), the median, the upper quartile (UQ) and the maximal observed value (Max.). Metric Min. LQ Median UQ Max. Bias Prec Accy Estimation via the Proposed Method. Our method has two probabilistic elements: (i) the actual partitioning and (ii) sampling for estimation. Accordingly, we performed 10 trial s for each setting. In each trial, we perform the entire estimation process from the partitioning/clustering to the sampling for estimation. First, we compare the estimate obtained by our method to the estimate obtained from SRS utilizing the full sampling apparatus. For SRS, the sample size of 100,000 was selected, because that is the smallest sample size where the estimates are within the error bound with (almost) the required probability. For the proposed scheme, we selected minm SE =.05. Section will explain how this value was selected. Table 2: Comparison of the proposed scheme (minmse =.05) with SRS Method CFP SRS Esimate Bias Precision Accuracy Lower Qartile Median Upper Quartile Sample Size Lower Qartile 24, ,000 Median 25, ,000 Upper Quartile 25, ,000 20

Estimating False Negatives for Classification Problems with Cluster Structure

Estimating False Negatives for Classification Problems with Cluster Structure Estimating False Negatives for Classification Problems with Cluster Structure György J. Simon Vipin Kumar Zhi-Li Zhang Department of Computer Science, University of Minnesota {gsimon,kumar,zhzhang}@cs.umn.edu

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Performance Evaluation and Comparison

Performance Evaluation and Comparison Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation

More information

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 4 Discriminant Analysis, k-nearest Neighbors Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

Hypothesis Evaluation

Hypothesis Evaluation Hypothesis Evaluation Machine Learning Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Hypothesis Evaluation Fall 1395 1 / 31 Table of contents 1 Introduction

More information

More on Unsupervised Learning

More on Unsupervised Learning More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Heuristics for The Whitehead Minimization Problem

Heuristics for The Whitehead Minimization Problem Heuristics for The Whitehead Minimization Problem R.M. Haralick, A.D. Miasnikov and A.G. Myasnikov November 11, 2004 Abstract In this paper we discuss several heuristic strategies which allow one to solve

More information

Anomaly Detection. Jing Gao. SUNY Buffalo

Anomaly Detection. Jing Gao. SUNY Buffalo Anomaly Detection Jing Gao SUNY Buffalo 1 Anomaly Detection Anomalies the set of objects are considerably dissimilar from the remainder of the data occur relatively infrequently when they do occur, their

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Data Preprocessing. Cluster Similarity

Data Preprocessing. Cluster Similarity 1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M

More information

Performance evaluation of binary classifiers

Performance evaluation of binary classifiers Performance evaluation of binary classifiers Kevin P. Murphy Last updated October 10, 2007 1 ROC curves We frequently design systems to detect events of interest, such as diseases in patients, faces in

More information

Unsupervised Anomaly Detection for High Dimensional Data

Unsupervised Anomaly Detection for High Dimensional Data Unsupervised Anomaly Detection for High Dimensional Data Department of Mathematics, Rowan University. July 19th, 2013 International Workshop in Sequential Methodologies (IWSM-2013) Outline of Talk Motivation

More information

How do we compare the relative performance among competing models?

How do we compare the relative performance among competing models? How do we compare the relative performance among competing models? 1 Comparing Data Mining Methods Frequent problem: we want to know which of the two learning techniques is better How to reliably say Model

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

More information

Introduction to Supervised Learning. Performance Evaluation

Introduction to Supervised Learning. Performance Evaluation Introduction to Supervised Learning Performance Evaluation Marcelo S. Lauretto Escola de Artes, Ciências e Humanidades, Universidade de São Paulo marcelolauretto@usp.br Lima - Peru Performance Evaluation

More information

IENG581 Design and Analysis of Experiments INTRODUCTION

IENG581 Design and Analysis of Experiments INTRODUCTION Experimental Design IENG581 Design and Analysis of Experiments INTRODUCTION Experiments are performed by investigators in virtually all fields of inquiry, usually to discover something about a particular

More information

Diagnostics. Gad Kimmel

Diagnostics. Gad Kimmel Diagnostics Gad Kimmel Outline Introduction. Bootstrap method. Cross validation. ROC plot. Introduction Motivation Estimating properties of an estimator. Given data samples say the average. x 1, x 2,...,

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel Lecture 2 Judging the Performance of Classifiers Nitin R. Patel 1 In this note we will examine the question of how to udge the usefulness of a classifier and how to compare different classifiers. Not only

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 Statistics Boot Camp Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018 March 21, 2018 Outline of boot camp Summarizing and simplifying data Point and interval estimation Foundations of statistical

More information

Probability and Statistics

Probability and Statistics Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT

More information

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Fundamentals to Biostatistics Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Statistics collection, analysis, interpretation of data development of new

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

Evaluation Metrics for Intrusion Detection Systems - A Study

Evaluation Metrics for Intrusion Detection Systems - A Study Evaluation Metrics for Intrusion Detection Systems - A Study Gulshan Kumar Assistant Professor, Shaheed Bhagat Singh State Technical Campus, Ferozepur (Punjab)-India 152004 Email: gulshanahuja@gmail.com

More information

AP Statistics Cumulative AP Exam Study Guide

AP Statistics Cumulative AP Exam Study Guide AP Statistics Cumulative AP Eam Study Guide Chapters & 3 - Graphs Statistics the science of collecting, analyzing, and drawing conclusions from data. Descriptive methods of organizing and summarizing statistics

More information

Performance Evaluation and Hypothesis Testing

Performance Evaluation and Hypothesis Testing Performance Evaluation and Hypothesis Testing 1 Motivation Evaluating the performance of learning systems is important because: Learning systems are usually designed to predict the class of future unlabeled

More information

POPULATION AND SAMPLE

POPULATION AND SAMPLE 1 POPULATION AND SAMPLE Population. A population refers to any collection of specified group of human beings or of non-human entities such as objects, educational institutions, time units, geographical

More information

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations:

Measures of center. The mean The mean of a distribution is the arithmetic average of the observations: Measures of center The mean The mean of a distribution is the arithmetic average of the observations: x = x 1 + + x n n n = 1 x i n i=1 The median The median is the midpoint of a distribution: the number

More information

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics

Last Lecture. Distinguish Populations from Samples. Knowing different Sampling Techniques. Distinguish Parameters from Statistics Last Lecture Distinguish Populations from Samples Importance of identifying a population and well chosen sample Knowing different Sampling Techniques Distinguish Parameters from Statistics Knowing different

More information

Lecture 5: Sampling Methods

Lecture 5: Sampling Methods Lecture 5: Sampling Methods What is sampling? Is the process of selecting part of a larger group of participants with the intent of generalizing the results from the smaller group, called the sample, to

More information

Performance Evaluation

Performance Evaluation Performance Evaluation David S. Rosenberg Bloomberg ML EDU October 26, 2017 David S. Rosenberg (Bloomberg ML EDU) October 26, 2017 1 / 36 Baseline Models David S. Rosenberg (Bloomberg ML EDU) October 26,

More information

Regression Clustering

Regression Clustering Regression Clustering In regression clustering, we assume a model of the form y = f g (x, θ g ) + ɛ g for observations y and x in the g th group. Usually, of course, we assume linear models of the form

More information

Predictive analysis on Multivariate, Time Series datasets using Shapelets

Predictive analysis on Multivariate, Time Series datasets using Shapelets 1 Predictive analysis on Multivariate, Time Series datasets using Shapelets Hemal Thakkar Department of Computer Science, Stanford University hemal@stanford.edu hemal.tt@gmail.com Abstract Multivariate,

More information

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo Clustering Lecture 1: Basics Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics Clustering

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

Interpret Standard Deviation. Outlier Rule. Describe the Distribution OR Compare the Distributions. Linear Transformations SOCS. Interpret a z score

Interpret Standard Deviation. Outlier Rule. Describe the Distribution OR Compare the Distributions. Linear Transformations SOCS. Interpret a z score Interpret Standard Deviation Outlier Rule Linear Transformations Describe the Distribution OR Compare the Distributions SOCS Using Normalcdf and Invnorm (Calculator Tips) Interpret a z score What is an

More information

A first model of learning

A first model of learning A first model of learning Let s restrict our attention to binary classification our labels belong to (or ) We observe the data where each Suppose we are given an ensemble of possible hypotheses / classifiers

More information

Instance Selection. Motivation. Sample Selection (1) Sample Selection (2) Sample Selection (3) Sample Size (1)

Instance Selection. Motivation. Sample Selection (1) Sample Selection (2) Sample Selection (3) Sample Size (1) Instance Selection Andrew Kusiak 2139 Seamans Center Iowa City, Iowa 52242-1527 Motivation The need for preprocessing of data for effective data mining is important. Tel: 319-335 5934 Fax: 319-335 5669

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning CS4731 Dr. Mihail Fall 2017 Slide content based on books by Bishop and Barber. https://www.microsoft.com/en-us/research/people/cmbishop/ http://web4.cs.ucl.ac.uk/staff/d.barber/pmwiki/pmwiki.php?n=brml.homepage

More information

1 Motivation for Instrumental Variable (IV) Regression

1 Motivation for Instrumental Variable (IV) Regression ECON 370: IV & 2SLS 1 Instrumental Variables Estimation and Two Stage Least Squares Econometric Methods, ECON 370 Let s get back to the thiking in terms of cross sectional (or pooled cross sectional) data

More information

Probability theory basics

Probability theory basics Probability theory basics Michael Franke Basics of probability theory: axiomatic definition, interpretation, joint distributions, marginalization, conditional probability & Bayes rule. Random variables:

More information

Lectures 5 & 6: Hypothesis Testing

Lectures 5 & 6: Hypothesis Testing Lectures 5 & 6: Hypothesis Testing in which you learn to apply the concept of statistical significance to OLS estimates, learn the concept of t values, how to use them in regression work and come across

More information

Pointwise Exact Bootstrap Distributions of Cost Curves

Pointwise Exact Bootstrap Distributions of Cost Curves Pointwise Exact Bootstrap Distributions of Cost Curves Charles Dugas and David Gadoury University of Montréal 25th ICML Helsinki July 2008 Dugas, Gadoury (U Montréal) Cost curves July 8, 2008 1 / 24 Outline

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Methods and Criteria for Model Selection CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Introduce classifier evaluation criteria } Introduce Bias x Variance duality } Model Assessment }

More information

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012 CS-62 Theory Gems September 27, 22 Lecture 4 Lecturer: Aleksander Mądry Scribes: Alhussein Fawzi Learning Non-Linear Classifiers In the previous lectures, we have focused on finding linear classifiers,

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee

Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Business Analytics and Data Mining Modeling Using R Prof. Gaurav Dixit Department of Management Studies Indian Institute of Technology, Roorkee Lecture - 04 Basic Statistics Part-1 (Refer Slide Time: 00:33)

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Sociology 6Z03 Review II

Sociology 6Z03 Review II Sociology 6Z03 Review II John Fox McMaster University Fall 2016 John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 1 / 35 Outline: Review II Probability Part I Sampling Distributions Probability

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Measuring Discriminant and Characteristic Capability for Building and Assessing Classifiers

Measuring Discriminant and Characteristic Capability for Building and Assessing Classifiers Measuring Discriminant and Characteristic Capability for Building and Assessing Classifiers Giuliano Armano, Francesca Fanni and Alessandro Giuliani Dept. of Electrical and Electronic Engineering, University

More information

Learning Objectives for Stat 225

Learning Objectives for Stat 225 Learning Objectives for Stat 225 08/20/12 Introduction to Probability: Get some general ideas about probability, and learn how to use sample space to compute the probability of a specific event. Set Theory:

More information

Anomaly (outlier) detection. Huiping Cao, Anomaly 1

Anomaly (outlier) detection. Huiping Cao, Anomaly 1 Anomaly (outlier) detection Huiping Cao, Anomaly 1 Outline General concepts What are outliers Types of outliers Causes of anomalies Challenges of outlier detection Outlier detection approaches Huiping

More information

Module 3. Function of a Random Variable and its distribution

Module 3. Function of a Random Variable and its distribution Module 3 Function of a Random Variable and its distribution 1. Function of a Random Variable Let Ω, F, be a probability space and let be random variable defined on Ω, F,. Further let h: R R be a given

More information

Measuring Social Influence Without Bias

Measuring Social Influence Without Bias Measuring Social Influence Without Bias Annie Franco Bobbie NJ Macdonald December 9, 2015 The Problem CS224W: Final Paper How well can statistical models disentangle the effects of social influence from

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

Forecasting Wind Ramps

Forecasting Wind Ramps Forecasting Wind Ramps Erin Summers and Anand Subramanian Jan 5, 20 Introduction The recent increase in the number of wind power producers has necessitated changes in the methods power system operators

More information

Machine Learning Lecture Notes

Machine Learning Lecture Notes Machine Learning Lecture Notes Predrag Radivojac January 25, 205 Basic Principles of Parameter Estimation In probabilistic modeling, we are typically presented with a set of observations and the objective

More information

Performance Evaluation

Performance Evaluation Performance Evaluation Confusion Matrix: Detected Positive Negative Actual Positive A: True Positive B: False Negative Negative C: False Positive D: True Negative Recall or Sensitivity or True Positive

More information

Machine Learning Practice Page 2 of 2 10/28/13

Machine Learning Practice Page 2 of 2 10/28/13 Machine Learning 10-701 Practice Page 2 of 2 10/28/13 1. True or False Please give an explanation for your answer, this is worth 1 pt/question. (a) (2 points) No classifier can do better than a naive Bayes

More information

Sampling Strategies to Evaluate the Performance of Unknown Predictors

Sampling Strategies to Evaluate the Performance of Unknown Predictors Sampling Strategies to Evaluate the Performance of Unknown Predictors Hamed Valizadegan Saeed Amizadeh Milos Hauskrecht Abstract The focus of this paper is on how to select a small sample of examples for

More information

Part 7: Glossary Overview

Part 7: Glossary Overview Part 7: Glossary Overview In this Part This Part covers the following topic Topic See Page 7-1-1 Introduction This section provides an alphabetical list of all the terms used in a STEPS surveillance with

More information

Active learning. Sanjoy Dasgupta. University of California, San Diego

Active learning. Sanjoy Dasgupta. University of California, San Diego Active learning Sanjoy Dasgupta University of California, San Diego Exploiting unlabeled data A lot of unlabeled data is plentiful and cheap, eg. documents off the web speech samples images and video But

More information

Least Squares Classification

Least Squares Classification Least Squares Classification Stephen Boyd EE103 Stanford University November 4, 2017 Outline Classification Least squares classification Multi-class classifiers Classification 2 Classification data fitting

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu October 19, 2014 Methods to Learn Matrix Data Set Data Sequence Data Time Series Graph & Network

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014 CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute NAME: Prof.

More information

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size Berkman Sahiner, a) Heang-Ping Chan, Nicholas Petrick, Robert F. Wagner, b) and Lubomir Hadjiiski

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Applying cluster analysis to 2011 Census local authority data

Applying cluster analysis to 2011 Census local authority data Applying cluster analysis to 2011 Census local authority data Kitty.Lymperopoulou@manchester.ac.uk SPSS User Group Conference November, 10 2017 Outline Basic ideas of cluster analysis How to choose variables

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Linear Classifiers as Pattern Detectors

Linear Classifiers as Pattern Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2014/2015 Lesson 16 8 April 2015 Contents Linear Classifiers as Pattern Detectors Notation...2 Linear

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Lecturer: Dr. Adote Anum, Dept. of Psychology Contact Information:

Lecturer: Dr. Adote Anum, Dept. of Psychology Contact Information: Lecturer: Dr. Adote Anum, Dept. of Psychology Contact Information: aanum@ug.edu.gh College of Education School of Continuing and Distance Education 2014/2015 2016/2017 Session Overview In this Session

More information

Lecture 3: Pattern Classification

Lecture 3: Pattern Classification EE E6820: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 1 2 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mixtures

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

Probabilistic clustering

Probabilistic clustering Aprendizagem Automática Probabilistic clustering Ludwig Krippahl Probabilistic clustering Summary Fuzzy sets and clustering Fuzzy c-means Probabilistic Clustering: mixture models Expectation-Maximization,

More information

CHOOSING THE RIGHT SAMPLING TECHNIQUE FOR YOUR RESEARCH. Awanis Ku Ishak, PhD SBM

CHOOSING THE RIGHT SAMPLING TECHNIQUE FOR YOUR RESEARCH. Awanis Ku Ishak, PhD SBM CHOOSING THE RIGHT SAMPLING TECHNIQUE FOR YOUR RESEARCH Awanis Ku Ishak, PhD SBM Sampling The process of selecting a number of individuals for a study in such a way that the individuals represent the larger

More information

Statistical inference

Statistical inference Statistical inference Contents 1. Main definitions 2. Estimation 3. Testing L. Trapani MSc Induction - Statistical inference 1 1 Introduction: definition and preliminary theory In this chapter, we shall

More information

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II) Contents Lecture Lecture Linear Discriminant Analysis Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University Email: fredriklindsten@ituuse Summary of lecture

More information

Enabling Quality Control for Entity Resolution: A Human and Machine Cooperation Framework

Enabling Quality Control for Entity Resolution: A Human and Machine Cooperation Framework School of Computer Science, Northwestern Polytechnical University Enabling Quality Control for Entity Resolution: A Human and Machine Cooperation Framework Zhaoqiang Chen, Qun Chen, Fengfeng Fan, Yanyan

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information