SEMI-ANALYTICAL METHOD FOR ANALYZING MODELS AND MODEL SELECTION MEASURES

Size: px

Start display at page:

Download "SEMI-ANALYTICAL METHOD FOR ANALYZING MODELS AND MODEL SELECTION MEASURES"

Ella Simon
5 years ago
Views:

1 SEMI-ANALYTICAL METHOD FOR ANALYZING MODELS AND MODEL SELECTION MEASURES By AMIT DHURANDHAR A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA

2 c 2009 Amit Dhurandhar 2

3 To my family, friends and professors 3

4 ACKNOWLEDGMENTS First and foremost, I would like to thank the almighty for giving me the strength to overcome both academic and emotional challenges that I have faced in my pursuit of earning a doctorate degree. Without his strength I would not have been in this position today. Second, I would like to thank my family for their continued support and for the fun we have when we all get together. A very special thanks to my advisor, Dr. Alin Dobra, for not only his guidance but also for the great commoradory that we share. I am greatful for having met such an intelligent, creative, full-of-life yet patient and helpful individual. I have thoroughly enjoyed the intense discussions (which others mistook for fights and actually bet on who will win) we have had in this time. I would like to thank Dr. Paul Gader and Dr. Arunava Banerjee for their insightful suggestions and encouragement during difficult times. I would also like to thank my other committee members Dr. Sanjay Ranka and Dr. Ravindra Ahuja for their invaluable inputs. I feel fortunate to have taken courses with Dr. Meera Sitharam and Dr. Anand Rangarajan who are great teachers and taught me what it means to understand something. Last but definitely not the least, I would like to thank my friends and roomates for without them life would have been dry. A special thanks to Hale, Kartik (or Kartiks should I say), Bhuppi, Ajit, Gnana, Somnath and many others for their support and encouragement. Thanks a lot guys! This would not have been possible without you all. 4

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS LIST OF TABLES LIST OF FIGURES ABSTRACT CHAPTER 1 INTRODUCTION Practical Impact Related Work Methodology What is the Methodology? Why have such a Methodology? How do I Implement the Methodology? Applying the Methodology Algorithmic Perspective Dataset Perspective Research Goals GENERAL FRAMEWORK Generalization Error (GE) Alternative Methods for Computing the Moments of GE ANALYSIS OF MODEL SELECTION MEASURES Hold-out Set Error Multifold Cross Validation Error NAIVE BAYES CLASSIFIER, SCALABILITY and EXTENSIONS Example: Naive Bayes Classifier Naive Bayes Classifier Model (NBC) Computation of the Moments of GE Full-Fledged NBC Calculation of Basic Probabilities Direct Calculation Approximation Techniques Series approximations (SA) Optimization Random sampling using formulations (RS)

6 4.2.4 Empirical Comparison of Cumulative Distribution Function Computing Methods Monte Carlo (MC) vs Random Sampling Using Formulations Calculation of Cumulative Joint probabilities Moment Comparison of Test Metrics Hold-out Set Cross Validation Comparison of GE, HE, and CE Extension ANALYZING DECISION TREES Computing Moments Technical Framework All Attribute Decision Trees (ATT) Decision Trees with Non-trivial Stopping Criteria Characterizing path exists for Three Stopping Criteria Split Attribute Selection Random Decision Trees Putting things together Fixed Height Purity and Scarcity Experiments Discussion Extension Scalability Take-aways K-NEAREST NEIGHBOR CLASSIFIER Specific Contributions Technical Framework K-Nearest Neighbor Algorithm Computation of Moments General Characterization Efficient Characterization for Sample Independent Distance Metrics Scalability Issues Experiments General Setup Study 1: Performance of the KNN Algorithm for Different Values of k Study 2: Convergence of the KNN Algorithm with Increasing Sample Size Study 3: Relative Performance of 10-fold Cross Validation on Synthetic Data

7 6.6.5 Study 4: Relative Performance of 10-fold Cross Validation on Real Datasets Discussion Possible Extensions Take-aways INSIGHTS INTO CROSS-VALIDATION Preliminaries Overview of the Customized Expressions Related Work Experiments Variance Expected value Expected value square + variance Take-aways CONCLUSION APPENDIX: PROOFS REFERENCES BIOGRAPHICAL SKETCH

8 Table LIST OF TABLES page 2-1 Notation used throughout the thesis Contingency table of input X Naive Bayes Notation Empirical Comparison of the cdf computing methods in terms of execution time. RS n denotes the Random Sampling procedure using n samples to estimate the probabilities % confidence bounds for Random Sampling Comparison of methods for computing the cdf Contingency table with v classes, M input vectors and total sample size N = M,v i=1,j=1 N ij

9 Figure LIST OF FIGURES page 4-1 I have two attributes each having two values with 2 class lables The current iterate ȳ k just satisfies the constraint c l and easily satisfies the other constraints Estimates of expected value of GE by MC and RS with increasing training set size N Estimates of expected value of GE by MC and RS with increasing training set size N Estimates of expected value of GE by MC and RS with increasing training set size N Estimates of expected value of GE by MC and RS with increasing training set size N Estimates of expected value of GE by MC and RS with increasing training set size N The plot is of the polynomial (x + 10) 4 x 2 y + (y + 10) 4 y 2 x z = HE expectation in single dimension HE variance in single dimension HE E[] + Std() in single dimension HE expectation in multiple dimensions HE variance in multiple dimensions HE E[] + Std() in multiple dimensions Expectation of CE Individual run variance of CE Pairwise covariances of CV Total variance of cross validation E [] + Var (()) of CV Convergence behavior CE expectation Individual run variance of CE

10 4-23 Pairwise covariances of CV Total variance of cross validation E [] + Var (()) of CV Convergence behavior The all attribute tree with 3 attributes A 1, A 2, A 3, each having 2 values Given 3 attributes A 1, A 2, A 3, the path m 11 m 21 m 31 is formed irrespective of the ordering of the attributes Fixed Height trees with d = 5, h = 3 and attributes with binary splits Fixed Height trees with d = 5, h = 3 and attributes with ternary splits Fixed Height trees with d = 8, h = 3 and attributes with binary splits Purity based trees with d = 5 and attributes with binary splits Purity based trees with d = 5 and attributes with ternary splits Purity based trees with d = 8 and attributes with binary splits Scarcity based trees with d = 5, pb = N Scarcity based trees with d = 5, pb = N Scarcity based trees with d = 8, pb = N 10 and attributes with binary splits and attributes with ternary splits and attributes with binary splits Comparison between AF and MC on three UCI datasets for trees prunned based on fixed height (h = 3), purity and scarcity (pb = N ) b, c and d are the 3 nearest neighbours of a The Figure shows the extent to which a point x i is near to x Behavior of the GE for different values of k Convergence of the GE for different values of k Comparison between the GE and 10 fold Cross validation error (CE) estimate for different values of k when the sample size (N) is Comparison between the GE and 10 fold Cross validation error (CE) estimate for different values of k when the sample size (N) is Comparison between true error (TE) and CE on 2 UCI datasets Var(HE) for small sample size and low correlation Var(HE) for small sample size and medium correlation

11 7-3 Var(HE) for small sample size and high correlation Var(HE) for larger sample size and low correlation Var(HE) for larger sample size and medium correlation Var(HE) for larger sample size and high correlation Cov(HE i, HE j ) for small sample size and low correlation Cov(HE i, HE j ) for small sample size and medium correlation Cov(HE i, HE j ) for small sample size and high correlation Cov(HE i, HE j ) for larger sample size and low correlation Cov(HE i, HE j ) for larger sample size and medium correlation Cov(HE i, HE j ) for larger sample size and high correlation Var(CE) for small sample size and low correlation Var(CE) for small sample size and medium correlation Var(CE) for small sample size and high correlation Var(CE) for larger sample size and low correlation Var(CE) for larger sample size and medium correlation Var(CE) for larger sample size and high correlation E[CE] for small sample size and low correlation E[CE] for larger sample size and low correlation E[CE] for small sample size at medium and high correlation E 2 [CE] + V ar(ce) for small sample size and low correlation E 2 [CE] + V ar(ce) for small sample size and medium correlation E 2 [CE] + V ar(ce) for small sample size and high correlation E 2 [CE] + V ar(ce) for larger sample size and low correlation E 2 [CE] + V ar(ce) for larger sample size and medium correlation E 2 [CE] + V ar(ce) for larger sample size and high correlation A-1 Instances of possible arrangements

12 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy SEMI-ANALYTICAL METHOD FOR ANALYZING MODELS AND MODEL SELECTION MEASURES Chair: Alin Dobra Major: Computer Engineering By Amit Dhurandhar August 2009 Considering the large amounts of data that is collected everyday in various domains such as health care, financial services, astrophysics and many others, there is a pressing need to convert this information into knowledge. Machine learning and data mining are both concerned with achieving this goal in a scalable fashion. The main theme of my work has been to analyze and better understand prevalent classification techniques and paradigms which are an integral part of machine learning and data mining research, with an aim to reduce the hiatus between theory and practice. Machine learning and data mining researchers have developed a plethora of classification algorithms to tackle classification problems. Unfortunately, no one algorithm is superior to the others in all scenarios and neither is it totally clear as to which algorithm should be preferred over others under specific circumstances. Hence, an important question now is, what is the best choice of a classification algorithm for a particular application? This problem is termed as classification model selection and is a very important problem in machine learning and data mining. The primary focus of my research has been to propose a novel methodology to study these classification algorithms accurately and efficiently in the non-asymptotic regime. In particular, we propose a moment based method where by focusing on the probabilistic space of classifiers induced by the classification algorithm and datasets of size N drawn independently and identically from a joint distribution (i.i.d.), we obtain efficient characterizations 12

13 for computing the moments of the generalization error. Moreover, we can also study model selection techniques such as cross-validation, leave-one-out and hold out set in our proposed framework. This is possible since we have also established general relationships between the moments of the generalization error and moments of the hold-out-set error, cross-validation error and leave-one-out error. Deploying the methodology we were able to provide interesting explanations for the behavior of cross-validation. The methodology aims at covering the gap between results predicted by theory and the behavior observed in practice. 13

14 CHAPTER 1 INTRODUCTION A significant portion of the work in machine learning is dedicated to designing new learning methods or better understanding, at a macroscopic level (i.e. performance over various datasets), the known learning methods. The body of work that tries to understand microscopic (i.e. essence of the method) behavior of either models or methods to evaluate models which I think is crucial for deepening the understanding of machine learning techniques and results and establish solid connections with Statistics is rather small. The two prevalent approaches to establish such results are based on either theory or empirical studies but usually not both, unless empirical studies are used to validate the theory. While both methods are powerful in themselves, each suffers from at least a major deficiency. The theoretical method depends on nice, closed form formulae that usually restricts the types of results that can be obtained to asymptotic results or statistical learning theory (SLT) type of results Vapnik [1998]. Should formulae become large and tedious to manipulate, the theoretical results are hard to obtain and use/interpret. The empirical method is well suited for validating intuitions but is significantly less useful for finding novel, interesting things since large number of experiments have to be conducted in order to reduce the error to a reasonable level. This is particularly difficult when small probabilities are involved, making the empirical evaluation impractical in such a case. An ideal scenario, from the point of view of producing interesting results, would be to use theory to make as much progress as possible but potentially obtaining uninterpretable formulae, followed by visualization to understand and find consequences of such formulae. This would avoid the limitation of theory to use only nice formulae and the limitation of empirical studies to perform large experiments. The role of the theory could be to significantly reduce the amount of computation required and the role 14

15 of visualization to understand the potentially complicated theoretical formulae. This is precisely what I propose, a new hybrid method to characterize and understand models and model selection measures (i.e. methods that evaluate learning models). The work I present here is an initial forray into what might prove to be an useful tool for studying learning algorithms. I call this method semi-analytical, since not just the formulae, but visualization in conjunction with the formulae lead to interpretability. What makes such an endeavor possible is the fact that, mostly due to the linearity of expectation, moments of complicated random variables can be computed exactly with efficient formulae, even though deriving the exact distribution in the form of small closed form formulae is a daunting task. 1.1 Practical Impact In this subsection I discuss the impact of the proposed research on industry and the field machine learning and data mining in general. Impact on industry and other fields: In todays day and age adaptive classification models find applicability in a wide spectrum of applications ranging over various domains. Financial Firms deploy these models for security purposes such as fraud detection, intrusion detection. Credit Card companies use these models to make credit card offers to people by categorizing them based on their previous transaction history. Giant chains of Supermarkets use these models to figure out which group of items are generally bought together by the customer. These models are used extensively in Bioinformatics for problems such as gene classification based on functionality, DNA/protein sequence matching, etc. They also find application in medicine for the analysis of the importance of clinical parameters and their combinations prediction of disease progression, extraction of medical knowledge for outcome research, therapy planning and support, and for the overall patient management. Todays state-of-the-art search engines also use classification models. This is just a snapshot of the entire range of applications they are used for. 15

16 Noticing the wide applicability of classification models and the shear extent of their number, it is but a desired goal that I choose the correct model for our specific application. I, through our research hope to take a forward step in this direction. Impact on the machine learning and data mining research: I believe that the research will assist in providing new insight into the behavior of classification models and model selection measures. The framework may be used as an exploratory tool for observing and understanding models and selection measures under specific circumstances that interest the user. It is possible that other related problems may also be framed in an analogous fashion leading to interesting observations and consequent interpretations. 1.2 Related Work A critical piece of theoretical work that is coherent and provides structure in comparing learning methods is given by Statistical Learning Theory Vapnik [1998]. SLT categorizes classification algorithms(actually the more general learning algorithms) into different classes called Concept Classes. The concept class of a classification algorithm is determined by its Vapnik-Chervonenkis(VC) dimension which is related to the shattering capability of the algorithm. Given a 2 class problem, the shattering capability of a function refers to the maximum number of points that the function can classify without making any errors, for all possible assignments of the class labels to the points in some chosen configuration. The shattering capability of an algorithm is the supremum of the shattering capabilities of all the functions it can represent. Distribution free bounds on the generalization error expected error over the entire input, of a classifier built using a particular classification algorithm belonging to a concept class are derived in SLT. The bounds are functions of the VC dimension and the sample size. The strength of this technique is that by finding the VC dimension of an algorithm I can derive error bounds for the classifiers built using this algorithm without ever referring to the underlying distribution. A fallout of this very general characterization is that the bounds are usually 16

17 loose Boucheron et al. [2005], Williamson [2001] which in turn result in making statements about any particular classifier weak. There is a large body of both experimental and theoretical work that addresses the problem of understanding various model selection measures. The model selection measures that relevant to our discussion, are Hold-out-set validation, Cross-validation. Shao [1993] showed that asymptotically Leave-one-out(LOO) chooses the best but not the simplest model. Devroye et al. [1996] derived distribution free bounds for cross validation. The bounds they found were for the nearest neighbour model. Breiman [1996] showed that cross validation gives an unbiased estimate of the first moment of the Generalization error. Though cross validation has desired characteristics with estimating the first moment, Breiman stated that its variance can be significant. Theoritical bounds on LOO error under certain algorithmic stability assumptions were given by Kearns and Ron [1997]. They showed that the worst case error of the LOO estimate is not much worse than the training error estimate. Elisseeff and Pontil [2003] introduced the notion of training stability. They showed that even with this weaker notion of stability good bounds could be obtained on the generalization error. Blum et al. [1999] showed that v-fold cross validation is at least as good as N v hold out set estimation on expectation. Kohavi [1995] conducted experiments on Naive Bayes and C4.5 using cross-validation. Through his experiments he concluded that 10 fold stratified cross validation should be used for model selection. Moore and Lee [1994] proposed heuristics to speed up cross-validation. Plutowski [1996] survey included proposals with theoritical results, heuristics and experiments on cross-validation. His survey was especially geared towards the behavior of cross-validation on neural networks. He inferred from the previously published results that cross-validation is robust. More recently, Bengio and Grandvalet [2003] proved that there is no universally unbiased estimator of the variance of cross-validation. Zhu and Rohwer [1996] proposed a simple setting in which cross-validation performs poorly. Goutte [1997] refuted this 17

18 proposed setting and claimed that a realistic scenario in which cross-validation fails is still an open question. The work I present here covers the middle ground between these theoretical and empirical results by allowing classifier specific results based on moment analysis. Such an endeavor is important since the gap between theoretical and empirical results is significant Langford [2005]. Preliminary work of this nature was done in Braga-Neto and Dougherty [2005] where the authors characterized the discrete histogram rule. However, their analysis does not provide any indication of how other more popular algorithms can be characterized in similar fashion keeping in mind scalability and accuracy. Specific classification schemes such as the W-statistic Anderson [2003] have been characterized in the past, but such analysis is very much limited to that and other similar statistics. The methodology I present here may potentially be applicable to large variety of learning algorithms. 1.3 Methodology What is the Methodology? The methodology for studying classification models consists in studying the behavior of the first two central moments of the GE of the classification algorithm studied. The moments are taken over the space of all possible classifiers produced by the classification algorithm, by training it over all possible datasets sampled independently and identically (i.i.d.) from some distribution. The first two moments give enough information about the statistical behavior of the classification algorithm to allow interesting observations about its behavior/trends. Higher moments may be computed using the same strategy suggested but might prove to be inefficient to compute Why have such a Methodology? The answers to the following questions shed light on why the methodology is necessary if tight statistical characterization is to be provided for classification algorithms. 18

19 1. Why study GE? The biggest danger of learning is overfitting the training data. The main idea in using GE as a measure of success of learning instead on the empirical error on a given dataset is to provide a mechanism to avoid this pitfall. Implicitly, by analyzing GE all the input is considered. 2. Why study the moments instead of the distribution of GE? Ideally, I would study the distribution of GE instead of moments in order to get a complete picture of what is its behavior. Studying the distribution of discrete random variables, except for very simple cases, turns out to be very hard. The difficulty comes from the fact that even computing the pdf in a single point is intractable since all combinations of random choices that result in the same value for GE have to be enumerated. On the other hand, the first two central moments coupled with distribution independent bounds such as Chebychev and Chernoff give guarantees about the worst possible behavior that are not too far from the actual behavior (small constant factor). Interestingly, it is possible to compute the moments of a random variable like GE without ever explicitly writing or making use of the formula for the pmf/pdf. What makes such an endeavor possible is extensive use of the linearity of expectation as is explained later. 3. Why characterize a class of classifiers instead of a single classifier? While the use of GE as the success measure is standard practice in Machine Learning, characterizing classes of classifiers instead of the particular classifier produced on a given dataset is not. From the point of view of the analysis, without large testing datasets it is not possible to evaluate directly GE for a particular classifier. By considering classes of classifiers to which a classifier belongs, an indirect characterization is obtained for the particular classifier. This is precisely what Statistical Learning Theory (SLT) does; there the class of classifiers consists in all classifiers with the same VC dimension. The main problem with SLT results is that classes based on VC dimension are too large, thus results tend to be pessimistic. In the methodology, the class of classifiers consists only of the classifiers that are produced by the given classification algorithm from datasets of fixed size from the underlying distribution. This is the smallest probabilistic class in which the particular classifier produced on a given dataset can be placed in How do I Implement the Methodology? One way of approximately estimating the moments of GE over all possible classifiers for a particular classification algorithm is by directly using Monte Carlo. If I use Monte Carlo directly, I first need to produce a classifier on a sampled dataset then test on a number of test sets sampled from the same distribution acquiring an estimate of the GE of this classifier. Repeating this entire procedure a couple of times I would acquire estimates of GE for different classifiers. Then by averaging the error of these multiple classifiers I 19

20 would get an estimate of the first moment of GE. The variance of GE can also be similarly estimated. Another way of estimating the moments of GE, is by obtaining parametric expressions for them. If this can be accomplished the moments can be computed exactly. Moreover, by dexterously observing the manner in which expressions are derived for a particular classification algorithm, insights can be gained into analyzing other algorithms of interest. Though deriving the expressions may be a tedious task, using them I obtain highly accurate estimates of the moments. I propose this second alternative for analyzing models. The key to the analysis is focusing on the learning and inference phases of the algorithm. In cases where the parametric expressions are computationally intensive to compute directly, I show that approximating individual terms using optimization techniques and even Monte Carlo I obtain accurate estimates of the moments when compared to directly using Monte Carlo (first alternative) for the same computational cost. If the moments are to be studied on synthetic data then the distribution is anyway assumed and the parametric expressions can be directly used. If I have real data an empirical distribution can be built on the dataset and then the parametric expressions can be used. 1.4 Applying the Methodology It is important to note that the methodology is not aimed towards providing a way of estimating bounds for GE of a classifier on a given dataset. The primary goal is creating an avenue in which learning algorithms can be studied precisely i.e. studying the statistical behavior of a particular algorithm w.r.t. a chosen/built distribution. Below, I discuss the two most important perspectives in which the methodology can be applied Algorithmic Perspective If a researcher/practitioner designs a new classification algorithm, he/she needs to validate it. Standard practice is to validate the algorithm on a relatively small (5-20) number of datasets and to report the performance. By observing the behavior of only a 20

21 few instances of the algorithm the designer infers its quality. Moreover, if the algorithm under performs on some datasets, it can be sometimes difficult to pinpoint the precise reason for its failure. If instead he/she is able to derive parametric expressions for the moments of GE, the test results would be more relevant to the particular classification algorithm, since the moments are over all possible datasets of a particular size drawn i.i.d. from some chosen/built distribution. Testing individually on all these datasets is an impossible task. Thus, by computing the moments using the parametric expressions the algorithm would be tested on a plethora of datasets with the results being highly accurate. Moreover, since the testing is done in a controlled environment i.e. all the parameters are known to the designer while testing, he/she can precisely pinpoint the conditions under which the algorithm performs well and the conditions under which the algorithm under performs Dataset Perspective If an algorithm designer validates his/her algorithm by computing moments as mentioned earlier, it can instill greater confidence in the practitioner searching for an appropriate algorithm for his/her dataset. The reason for this being, if the practitioner has a dataset which has a similar structure or is from a similar source as the test dataset on which an empirical distribution was built and favourable results reported by the designer, then this would mean that the results apply not only to that particular test dataset, but to other similar type of datasets and since the practitioner s dataset belongs to this similar collection, the results would also apply to his. Note that a distribution is just a weighting of different datasets and this perspective is used in the above exposition. If the dataset is categorical, it can be precisely modelled by a multinomial distribution in the following manner. A multinomial is completely characterized by the probabilities in each of its cells (which sum to 1) and the total count N (sum of individual cell counts). The designer can set the number of cells in the multinomial to be the number of cells in his contingency table, with empirical estimates for the individual cell probabilities being 21

22 the corresponding cell counts divided by the size of the dataset which is the value of N. With this I have a fully specified multnomial distribution using which I can compute the formulations, consequently characterizing the moments of the GE. Since the estimates for the cell probabilities are based on the available dataset, the true underlying distribution of which this dataset is a sample, may have slightly different values. This scenario can be accounted for, by varying the cell probabilities to a desired degree and observing the variation in the estimates of GE. This would assist in deciphering the sensitivity of the model in question to noise. In the continuous case, there is no such generic distribution (as the multinomial), but a popular choice could be a mixture of Gaussians (other distributions could also be used). 1.5 Research Goals In this section I state the specific research goals that I have accomplished in this thesis work. General Framework: To provide a statistical characterization of classifiers, a probabilistic class of classifiers that contains the desired classifier has to be considered since the behavior of any particular classifier can be arbitrarily poor. The class considered by statistical learning theory is the class of classifiers with a given VC dimension Vapnik [1998]. While the results thus obtained are very general, no particularity of the classification algorithm is exploited. The class of classifiers considered in this thesis is the classifiers obtained by applying the classification algorithm to a dataset of given size sampled i.i.d. from the underlying distribution. This leads to a different way of characterizing classifiers based on moment analysis. I develop a framework to analyze classification algorithms. Analysis of Model Selection Measures: To relate the moments of the Generalization error (GE) to the moments of Cross-validation error (CE), Leave-one-out error (LE) and Hold-out-set error (HE). This will assist us in studying the behavior of these errors given the moments of any one of these errors. 22

23 Analysis of Specific Classification Models: To develop customized formulations for the moments for specific classification algorithms. This will aid in studying classification algorithms in conjunction with the selection measures. I choose the following models which are a mix of parametric and non-parametric models. 1. Naive Bayesian Classifier (NBC) model: NBC is a model which is extensively used in industry, due to its robustness outperforming its more sophisticated counterparts in many real world applications(eg. spam filtering in Mozilla Thunderbird and Microsoft Outlook, bio-informatics etc.). There has been work on the robustness of NBC Domingos and Pazzani [1997], Rish [2001], but the proposed framework and the inter-relationships between the moments of the various errors helps us to extensively study not just the model but also the behavior of the validation methods in conjunction with it. 2. Decision Trees (DT) model: Decision trees are also extensively used in data mining and machine learning applications. Besides performance, they are sometimes preferred over other models(eg. Support Vector Machines, neural nets) because the process by which the eventual classifier is built from the sample is transparent. The probabilistic formulations will incorporate various pruning conditions such as purity, scarcity and fixed height. The formulations will help better understand the behavior of these trees for classification. 3. K-Nearest-Neighbor (KNN) Classifier model: This model is one of the more simpler models but yet it is highly effective. Theoretical results exist Stone [1977], regarding convergence of the Generalization Error (GE) of this algorithm to Bayes error (best possible performance). However, this result is asymptotic and for finite sample sizes in real scenarios finding the optimal value of K is more of an art than science. The methodology proposed by us can used to study the algorithm for different values of K and for different distance metrics accurately in controlled settings. Scalability: To make the computation of the moments scalable. This is especially relevant when the domain is discrete and the computation of individual probabilities becomes computationally intensive. In these cases I have to come up with approximation techniques that are accurate and fast, making the analysis practical. Practical Study of Non-asymptotic Behavior of NBC, DT, KNN and Selection Measures: The formulas of the moments of GE and consequently HE, CE and LE for the NBC, DT and KNN that are derived using the general framework can be used to carry out an extensive study of the behavior of these classification algorithms in conjunction 23

24 with the model selection measures. I have carried out such a comparison with the aim of identifying interesting trends about the mentioned classification algorithms and the model selection measures to exemplify the utility of the theoretical framework. 24

25 CHAPTER 2 GENERAL FRAMEWORK Probability distributions completely characterize the behavior of a random variable. Moments of a random variable give us information about its probability distribution. Thus, if I have knowledge of the moments of a random variable I can make statements about its behavior. In some cases characterizing a finite subset of moments may prove to be a more desired alternative than characterizing the entire distribution which can be wild and computationally expensive to compute. This is precisely what I do when I study the behavior of the generalization error of a classifier and the error estimation methods viz. Hold-out-error, Leave-one-out error and Cross-validation error. Characterizing the distribution though possible, can turn out to be a tedious task and studying the moments instead is a more viable option. As a result, I employ moment analysis and use linearity of expectation to explore the relationship between various estimates for the error of classifiers: generalization error(ge), hold-out-set error(he), and cross validation error(ce) leave-one-out error is just a particular case of CE and I do not analyze it independently. The relationships are drawn by going over the space of all possible datasets. The actual computation of moments though is conducted by going over the space of classifiers induced by a particular classification algorithm and i.i.d. data. This is done since it leads to computational efficiency. I interchangably go over the space of datasets and space of classifiers as deemed appropriate, since the classification algorithm is assumed to be deterministic. That is I have, E D(N) [F(ζ[D(N)])] = E Z(N) [F(ζ)] = E x1... x m [F(ζ(x 1, x 2,..., x m ))] where F() is some function that operates on a classifier. I also consider the learning algorithms to be symmetric(the algorithm is oblivious to random permutations of the samples in the training dataset). Throughout this section and in the rest of the thesis I use the notation in Table 2-1 unless stated otherwise. 25

26 2.1 Generalization Error (GE) The notion of generalization error is defined with respect to an underlying probability distribution defined over the input output space and a loss function (error metric). I model this probability space with the random vector X for input and random variable Y for output. When the input is fixed, Y (x) is the random variable that models the output 1. I assume in this thesis that the domain X of X is discrete; all the theory can be extended to continuous essentially by replacing the counting measure with Lebesque measure and sums with integrals. Whenever the probability and expectation is with respect to this probabilistic space(i.e. (X, Y )) that models the problem I will not use any index. For other probabilistic spaces, I will specify by an index what is the probability space I refer to. I denote the error metric by λ(a, b); in this thesis I will use only the 0-1 metric that takes value 1 if a b and 0 otherwise. With this, the generalization error of a classifier ζ is: GE(ζ) = E [λ(ζ(x), Y )] = P [ζ(x) Y ] (2 1) = x X P [X =x] P [ζ(x) Y (x)] where I used the fact that, for 0-1 loss function the expectation is the probability that the prediction is erroneous. Notice that the notation using Y (x) is really a conditional on X =x. I use this notation since it is intuitive and more compact. The last equation for the generalization error is the most useful in this thesis since it decomposes a global measure, generalization error, defined over the entire space into micro measures, one for each input. 1 By modeling the output for a given input as a random variable, I allow the output to be randomized, as it might be in most real circumstances. 26

27 By carefully selecting the class of classifiers for which the moment analysis of the generalization error is performed, meaningful and relevant probabilistic statements can be made about the generalization error of a particular classifier from this class. The probability distribution over the classifiers will be based on the randomness of the data used to produce the classifier. To formalize this, let Z(N) be the class of classifiers built over a dataset of size N with a probability space defined over it. With this, the k-th moment around 0 of the generalization error is: [ E ] Z(N) GE(ζ) k = P Z [ζ] GE(ζ) k ζ Z(N) The problem with this definition is that it talks about global characterization of classifiers which can be hard to capture. I rewrite the formulae for the first and second moment in terms of fine granularity structure of the classifiers. While deriving these moments, I have to consider double expectations of the form: E Z(N) [E [F(x, ζ)]] with F(x, ζ) a function that depends both on the input x and the classifier. With this I arrive at the following result: E Z(N) [E [F(x, ζ)]] = P Z(N) [ζ] P [X =x] F(x, ζ) ζ Z(N) x X = P [X =x] P Z(N) [ζ] F(x, ζ) x X ζ Z(N) = E [ E Z(N) [F(x, ζ)] ] (2 2) that uses the fact that P [X =x] does not depend on a particular ζ and P Z(N) [ζ] does not depend on a particular x, even though both quantities depend on the underlying probability distribution. Using the definition of the moments above, Equation 2 1 and Equation 2 2 I have the following theorem. 27

28 Theorem 1. The first and second moment of GE are given by, E Z(N) [GE(ζ)] = x X P [X =x] y Y P Z(N) [ζ(x)=y] P [Y (x) y] and E Z(N) Z(N) [GE(ζ)GE(ζ )] = P [X =x] P [X =x ] x X x X P Z(N) Z(N) [ζ(x)=y ζ (x )=y ] y Y y Y P [Y (x) y] P [Y (x ) y ] Proof. E Z(N) [GE(ζ)] = E Z(N) [E [λ(ζ(x), Y )]] = E [ E Z(N) [λ(ζ(x), Y )] ] = P [X =x] x X = P [X =x] x X P Z(N) [ζ] P [ζ(x) Y (x) ζ] ζ Z(N) ζ Z(N) = P [X =x] x X y Y P Z(N) [ζ] P [ζ(x) = y, Y (x) y ζ] P Z(N) [ζ] P [ζ(x) = y, Y (x) y ζ] ζ Z(N) ζ(x)=y = x X P [X =x] y Y P Z(N) [ζ(x)=y] P [Y (x) y] E Z(N) Z(N) [GE(ζ)GE(ζ )] = E Z(N) Z(N) [E [λ(ζ(x), Y )] E [λ(ζ (X), Y )]] ( ) = P Z(N) Z(N) [ζ, ζ ] P [X =x] P [ζ(x) Y (x)] (ζ,ζ ) Z(N) Z(N) x X ( ) P [X =x] P [ζ (x) Y (x)] x X 28

29 = x X P [X =x] P [X =x ] P Z(N) Z(N) [ζ, ζ ] x X (ζ,ζ ) Z(N) Z(N) P [ζ(x) Y (x)] P [ζ (x ) Y (x )] = P [X =x] P [X =x ] x X x X P Z(N) Z(N) [ζ(x)=y ζ (x )=y ] y Y y Y P [Y (x) y] P [Y (x ) y ] In both series of equations I made the transition from a summation over the class of classifiers to a summation over the possible outputs since the focus changed from the classifier to the prediction of the classifier for a specific input(x is fixed inside the first summation). What this effectively does is it allows the computation of moments using only local information(behavior on particular inputs) not global information(behavior on all inputs). This results in speeding the process of computing the moments. 2.2 Alternative Methods for Computing the Moments of GE The method I introduced above for computing the moments of the generalization error are based on decomposing the moment into contributions of individual input-output pairs. With such a decomposition, not only the analysis becomes simpler, but the complexity of the algorithm required is reduced. In particular, the complexity of computing the first moment is proportional to the size of the input-output space and the complexity of estimating probabilities of the form P Z [ζ(x)=y]. The complexity of the second moment is quadratic in the size of the input-output space and proportional to the complexity of estimating P Z [ζ(x)=y ζ(x )=y ]. To see the advantage of this method, I compare it with the other two alternatives for computing the moments: definition based computation and Monte Carlo simulation. Definition based computation uses the definition of expectation. It consists in summing over all possible datasets and multiplying the generalization error of the classifier 29

30 built from the dataset with the probability to obtain the dataset as an i.i.d. sample from the underlying probability distribution. Formally, E D(N) [GE(ζ)] = P [D] GE(ζ[D]) (2 3) D D(N) where D(N) is the set of all possible datasets of size N. The number of possible datasets is exponential in N with the base of the exponent proportional to the size of the input-output space (the product of the sizes of the domains of inputs and outputs). Evaluating the moments in this manner is impractical for all but very small spaces and dataset sizes. Monte Carlo simulation is a simple way to estimate moments that consists in performing experiments to produce samples that determine the value of the generalization error. In this case, to estimate E D(N) [GE(ζ)], datasets of size N have to be generated, one for each sample desired. For each of these datasets a classifier has to be constructed according to the classifier construction algorithm. For the classifier produced, samples from the underlying probability distribution have to be generated in order to estimate the generalization error of this classifier. Especially for second moments, the amount of samples required will be large in order to obtain reasonable accuracy for the moments. If a study has to be conducted in order to determine the influence of various parameters of the data generation model, the overall amount of experiments that have to performed becomes infeasible. In summary, the advantages of the method I propose for estimating the moments are: (a) the formulations are exact, (b) it needs only local behavior of the classifier, (c) so the time complexity is reduced and (d) does not depend on the fact that some of the probabilities are small. I will use this method to compute moments of the generalization error for the NBC, DT and KNN algorithms. 30

31 Table 2-1. Notation used throughout the thesis. Symbol Meaning X Random vector modeling input X Domain of random vector (input space) X Y Random variable modeling output Y (x) Random variable modeling output for input x Y Set of class labels (output space) D Dataset (x, y) Data-point from dataset D D t Training dataset D s Testing dataset D i The i th part/fold of D (for cross validation) N Size of dataset N t Size of training dataset N s Size of testing dataset v Number of folds of cross validation ζ Classifier ζ[d] Classifier build from dataset D GE(ζ) Generalization error of classifier ζ HE(ζ) Hold-out-set error of classifier ζ CE(ζ) Cross validation error of classifier ζ Z(S) The set of classifiers obtained by application of classification algorithm to an i.i.d. set of size S D(S) Dataset of size S E Z(S) [] Expectation w.r.t. the space of classifiers built on a sample of size S 31

Probabilistic Characterization of Nearest Neighbor Classifier

Noname manuscript No. (will be inserted by the editor) Probabilistic Characterization of Nearest Neighbor Classifier Amit Dhurandhar Alin Dobra Received: date / Accepted: date Abstract The k-nearest Neighbor