SEMI-ANALYTICAL METHOD FOR ANALYZING MODELS AND MODEL SELECTION MEASURES

Size: px
Start display at page:

Download "SEMI-ANALYTICAL METHOD FOR ANALYZING MODELS AND MODEL SELECTION MEASURES"

Transcription

1 SEMI-ANALYTICAL METHOD FOR ANALYZING MODELS AND MODEL SELECTION MEASURES By AMIT DHURANDHAR A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA

2 c 2009 Amit Dhurandhar 2

3 To my family, friends and professors 3

4 ACKNOWLEDGMENTS First and foremost, I would like to thank the almighty for giving me the strength to overcome both academic and emotional challenges that I have faced in my pursuit of earning a doctorate degree. Without his strength I would not have been in this position today. Second, I would like to thank my family for their continued support and for the fun we have when we all get together. A very special thanks to my advisor, Dr. Alin Dobra, for not only his guidance but also for the great commoradory that we share. I am greatful for having met such an intelligent, creative, full-of-life yet patient and helpful individual. I have thoroughly enjoyed the intense discussions (which others mistook for fights and actually bet on who will win) we have had in this time. I would like to thank Dr. Paul Gader and Dr. Arunava Banerjee for their insightful suggestions and encouragement during difficult times. I would also like to thank my other committee members Dr. Sanjay Ranka and Dr. Ravindra Ahuja for their invaluable inputs. I feel fortunate to have taken courses with Dr. Meera Sitharam and Dr. Anand Rangarajan who are great teachers and taught me what it means to understand something. Last but definitely not the least, I would like to thank my friends and roomates for without them life would have been dry. A special thanks to Hale, Kartik (or Kartiks should I say), Bhuppi, Ajit, Gnana, Somnath and many others for their support and encouragement. Thanks a lot guys! This would not have been possible without you all. 4

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS LIST OF TABLES LIST OF FIGURES ABSTRACT CHAPTER 1 INTRODUCTION Practical Impact Related Work Methodology What is the Methodology? Why have such a Methodology? How do I Implement the Methodology? Applying the Methodology Algorithmic Perspective Dataset Perspective Research Goals GENERAL FRAMEWORK Generalization Error (GE) Alternative Methods for Computing the Moments of GE ANALYSIS OF MODEL SELECTION MEASURES Hold-out Set Error Multifold Cross Validation Error NAIVE BAYES CLASSIFIER, SCALABILITY and EXTENSIONS Example: Naive Bayes Classifier Naive Bayes Classifier Model (NBC) Computation of the Moments of GE Full-Fledged NBC Calculation of Basic Probabilities Direct Calculation Approximation Techniques Series approximations (SA) Optimization Random sampling using formulations (RS)

6 4.2.4 Empirical Comparison of Cumulative Distribution Function Computing Methods Monte Carlo (MC) vs Random Sampling Using Formulations Calculation of Cumulative Joint probabilities Moment Comparison of Test Metrics Hold-out Set Cross Validation Comparison of GE, HE, and CE Extension ANALYZING DECISION TREES Computing Moments Technical Framework All Attribute Decision Trees (ATT) Decision Trees with Non-trivial Stopping Criteria Characterizing path exists for Three Stopping Criteria Split Attribute Selection Random Decision Trees Putting things together Fixed Height Purity and Scarcity Experiments Discussion Extension Scalability Take-aways K-NEAREST NEIGHBOR CLASSIFIER Specific Contributions Technical Framework K-Nearest Neighbor Algorithm Computation of Moments General Characterization Efficient Characterization for Sample Independent Distance Metrics Scalability Issues Experiments General Setup Study 1: Performance of the KNN Algorithm for Different Values of k Study 2: Convergence of the KNN Algorithm with Increasing Sample Size Study 3: Relative Performance of 10-fold Cross Validation on Synthetic Data

7 6.6.5 Study 4: Relative Performance of 10-fold Cross Validation on Real Datasets Discussion Possible Extensions Take-aways INSIGHTS INTO CROSS-VALIDATION Preliminaries Overview of the Customized Expressions Related Work Experiments Variance Expected value Expected value square + variance Take-aways CONCLUSION APPENDIX: PROOFS REFERENCES BIOGRAPHICAL SKETCH

8 Table LIST OF TABLES page 2-1 Notation used throughout the thesis Contingency table of input X Naive Bayes Notation Empirical Comparison of the cdf computing methods in terms of execution time. RS n denotes the Random Sampling procedure using n samples to estimate the probabilities % confidence bounds for Random Sampling Comparison of methods for computing the cdf Contingency table with v classes, M input vectors and total sample size N = M,v i=1,j=1 N ij

9 Figure LIST OF FIGURES page 4-1 I have two attributes each having two values with 2 class lables The current iterate ȳ k just satisfies the constraint c l and easily satisfies the other constraints Estimates of expected value of GE by MC and RS with increasing training set size N Estimates of expected value of GE by MC and RS with increasing training set size N Estimates of expected value of GE by MC and RS with increasing training set size N Estimates of expected value of GE by MC and RS with increasing training set size N Estimates of expected value of GE by MC and RS with increasing training set size N The plot is of the polynomial (x + 10) 4 x 2 y + (y + 10) 4 y 2 x z = HE expectation in single dimension HE variance in single dimension HE E[] + Std() in single dimension HE expectation in multiple dimensions HE variance in multiple dimensions HE E[] + Std() in multiple dimensions Expectation of CE Individual run variance of CE Pairwise covariances of CV Total variance of cross validation E [] + Var (()) of CV Convergence behavior CE expectation Individual run variance of CE

10 4-23 Pairwise covariances of CV Total variance of cross validation E [] + Var (()) of CV Convergence behavior The all attribute tree with 3 attributes A 1, A 2, A 3, each having 2 values Given 3 attributes A 1, A 2, A 3, the path m 11 m 21 m 31 is formed irrespective of the ordering of the attributes Fixed Height trees with d = 5, h = 3 and attributes with binary splits Fixed Height trees with d = 5, h = 3 and attributes with ternary splits Fixed Height trees with d = 8, h = 3 and attributes with binary splits Purity based trees with d = 5 and attributes with binary splits Purity based trees with d = 5 and attributes with ternary splits Purity based trees with d = 8 and attributes with binary splits Scarcity based trees with d = 5, pb = N Scarcity based trees with d = 5, pb = N Scarcity based trees with d = 8, pb = N 10 and attributes with binary splits and attributes with ternary splits and attributes with binary splits Comparison between AF and MC on three UCI datasets for trees prunned based on fixed height (h = 3), purity and scarcity (pb = N ) b, c and d are the 3 nearest neighbours of a The Figure shows the extent to which a point x i is near to x Behavior of the GE for different values of k Convergence of the GE for different values of k Comparison between the GE and 10 fold Cross validation error (CE) estimate for different values of k when the sample size (N) is Comparison between the GE and 10 fold Cross validation error (CE) estimate for different values of k when the sample size (N) is Comparison between true error (TE) and CE on 2 UCI datasets Var(HE) for small sample size and low correlation Var(HE) for small sample size and medium correlation

11 7-3 Var(HE) for small sample size and high correlation Var(HE) for larger sample size and low correlation Var(HE) for larger sample size and medium correlation Var(HE) for larger sample size and high correlation Cov(HE i, HE j ) for small sample size and low correlation Cov(HE i, HE j ) for small sample size and medium correlation Cov(HE i, HE j ) for small sample size and high correlation Cov(HE i, HE j ) for larger sample size and low correlation Cov(HE i, HE j ) for larger sample size and medium correlation Cov(HE i, HE j ) for larger sample size and high correlation Var(CE) for small sample size and low correlation Var(CE) for small sample size and medium correlation Var(CE) for small sample size and high correlation Var(CE) for larger sample size and low correlation Var(CE) for larger sample size and medium correlation Var(CE) for larger sample size and high correlation E[CE] for small sample size and low correlation E[CE] for larger sample size and low correlation E[CE] for small sample size at medium and high correlation E 2 [CE] + V ar(ce) for small sample size and low correlation E 2 [CE] + V ar(ce) for small sample size and medium correlation E 2 [CE] + V ar(ce) for small sample size and high correlation E 2 [CE] + V ar(ce) for larger sample size and low correlation E 2 [CE] + V ar(ce) for larger sample size and medium correlation E 2 [CE] + V ar(ce) for larger sample size and high correlation A-1 Instances of possible arrangements

12 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy SEMI-ANALYTICAL METHOD FOR ANALYZING MODELS AND MODEL SELECTION MEASURES Chair: Alin Dobra Major: Computer Engineering By Amit Dhurandhar August 2009 Considering the large amounts of data that is collected everyday in various domains such as health care, financial services, astrophysics and many others, there is a pressing need to convert this information into knowledge. Machine learning and data mining are both concerned with achieving this goal in a scalable fashion. The main theme of my work has been to analyze and better understand prevalent classification techniques and paradigms which are an integral part of machine learning and data mining research, with an aim to reduce the hiatus between theory and practice. Machine learning and data mining researchers have developed a plethora of classification algorithms to tackle classification problems. Unfortunately, no one algorithm is superior to the others in all scenarios and neither is it totally clear as to which algorithm should be preferred over others under specific circumstances. Hence, an important question now is, what is the best choice of a classification algorithm for a particular application? This problem is termed as classification model selection and is a very important problem in machine learning and data mining. The primary focus of my research has been to propose a novel methodology to study these classification algorithms accurately and efficiently in the non-asymptotic regime. In particular, we propose a moment based method where by focusing on the probabilistic space of classifiers induced by the classification algorithm and datasets of size N drawn independently and identically from a joint distribution (i.i.d.), we obtain efficient characterizations 12

13 for computing the moments of the generalization error. Moreover, we can also study model selection techniques such as cross-validation, leave-one-out and hold out set in our proposed framework. This is possible since we have also established general relationships between the moments of the generalization error and moments of the hold-out-set error, cross-validation error and leave-one-out error. Deploying the methodology we were able to provide interesting explanations for the behavior of cross-validation. The methodology aims at covering the gap between results predicted by theory and the behavior observed in practice. 13

14 CHAPTER 1 INTRODUCTION A significant portion of the work in machine learning is dedicated to designing new learning methods or better understanding, at a macroscopic level (i.e. performance over various datasets), the known learning methods. The body of work that tries to understand microscopic (i.e. essence of the method) behavior of either models or methods to evaluate models which I think is crucial for deepening the understanding of machine learning techniques and results and establish solid connections with Statistics is rather small. The two prevalent approaches to establish such results are based on either theory or empirical studies but usually not both, unless empirical studies are used to validate the theory. While both methods are powerful in themselves, each suffers from at least a major deficiency. The theoretical method depends on nice, closed form formulae that usually restricts the types of results that can be obtained to asymptotic results or statistical learning theory (SLT) type of results Vapnik [1998]. Should formulae become large and tedious to manipulate, the theoretical results are hard to obtain and use/interpret. The empirical method is well suited for validating intuitions but is significantly less useful for finding novel, interesting things since large number of experiments have to be conducted in order to reduce the error to a reasonable level. This is particularly difficult when small probabilities are involved, making the empirical evaluation impractical in such a case. An ideal scenario, from the point of view of producing interesting results, would be to use theory to make as much progress as possible but potentially obtaining uninterpretable formulae, followed by visualization to understand and find consequences of such formulae. This would avoid the limitation of theory to use only nice formulae and the limitation of empirical studies to perform large experiments. The role of the theory could be to significantly reduce the amount of computation required and the role 14

15 of visualization to understand the potentially complicated theoretical formulae. This is precisely what I propose, a new hybrid method to characterize and understand models and model selection measures (i.e. methods that evaluate learning models). The work I present here is an initial forray into what might prove to be an useful tool for studying learning algorithms. I call this method semi-analytical, since not just the formulae, but visualization in conjunction with the formulae lead to interpretability. What makes such an endeavor possible is the fact that, mostly due to the linearity of expectation, moments of complicated random variables can be computed exactly with efficient formulae, even though deriving the exact distribution in the form of small closed form formulae is a daunting task. 1.1 Practical Impact In this subsection I discuss the impact of the proposed research on industry and the field machine learning and data mining in general. Impact on industry and other fields: In todays day and age adaptive classification models find applicability in a wide spectrum of applications ranging over various domains. Financial Firms deploy these models for security purposes such as fraud detection, intrusion detection. Credit Card companies use these models to make credit card offers to people by categorizing them based on their previous transaction history. Giant chains of Supermarkets use these models to figure out which group of items are generally bought together by the customer. These models are used extensively in Bioinformatics for problems such as gene classification based on functionality, DNA/protein sequence matching, etc. They also find application in medicine for the analysis of the importance of clinical parameters and their combinations prediction of disease progression, extraction of medical knowledge for outcome research, therapy planning and support, and for the overall patient management. Todays state-of-the-art search engines also use classification models. This is just a snapshot of the entire range of applications they are used for. 15

16 Noticing the wide applicability of classification models and the shear extent of their number, it is but a desired goal that I choose the correct model for our specific application. I, through our research hope to take a forward step in this direction. Impact on the machine learning and data mining research: I believe that the research will assist in providing new insight into the behavior of classification models and model selection measures. The framework may be used as an exploratory tool for observing and understanding models and selection measures under specific circumstances that interest the user. It is possible that other related problems may also be framed in an analogous fashion leading to interesting observations and consequent interpretations. 1.2 Related Work A critical piece of theoretical work that is coherent and provides structure in comparing learning methods is given by Statistical Learning Theory Vapnik [1998]. SLT categorizes classification algorithms(actually the more general learning algorithms) into different classes called Concept Classes. The concept class of a classification algorithm is determined by its Vapnik-Chervonenkis(VC) dimension which is related to the shattering capability of the algorithm. Given a 2 class problem, the shattering capability of a function refers to the maximum number of points that the function can classify without making any errors, for all possible assignments of the class labels to the points in some chosen configuration. The shattering capability of an algorithm is the supremum of the shattering capabilities of all the functions it can represent. Distribution free bounds on the generalization error expected error over the entire input, of a classifier built using a particular classification algorithm belonging to a concept class are derived in SLT. The bounds are functions of the VC dimension and the sample size. The strength of this technique is that by finding the VC dimension of an algorithm I can derive error bounds for the classifiers built using this algorithm without ever referring to the underlying distribution. A fallout of this very general characterization is that the bounds are usually 16

17 loose Boucheron et al. [2005], Williamson [2001] which in turn result in making statements about any particular classifier weak. There is a large body of both experimental and theoretical work that addresses the problem of understanding various model selection measures. The model selection measures that relevant to our discussion, are Hold-out-set validation, Cross-validation. Shao [1993] showed that asymptotically Leave-one-out(LOO) chooses the best but not the simplest model. Devroye et al. [1996] derived distribution free bounds for cross validation. The bounds they found were for the nearest neighbour model. Breiman [1996] showed that cross validation gives an unbiased estimate of the first moment of the Generalization error. Though cross validation has desired characteristics with estimating the first moment, Breiman stated that its variance can be significant. Theoritical bounds on LOO error under certain algorithmic stability assumptions were given by Kearns and Ron [1997]. They showed that the worst case error of the LOO estimate is not much worse than the training error estimate. Elisseeff and Pontil [2003] introduced the notion of training stability. They showed that even with this weaker notion of stability good bounds could be obtained on the generalization error. Blum et al. [1999] showed that v-fold cross validation is at least as good as N v hold out set estimation on expectation. Kohavi [1995] conducted experiments on Naive Bayes and C4.5 using cross-validation. Through his experiments he concluded that 10 fold stratified cross validation should be used for model selection. Moore and Lee [1994] proposed heuristics to speed up cross-validation. Plutowski [1996] survey included proposals with theoritical results, heuristics and experiments on cross-validation. His survey was especially geared towards the behavior of cross-validation on neural networks. He inferred from the previously published results that cross-validation is robust. More recently, Bengio and Grandvalet [2003] proved that there is no universally unbiased estimator of the variance of cross-validation. Zhu and Rohwer [1996] proposed a simple setting in which cross-validation performs poorly. Goutte [1997] refuted this 17

18 proposed setting and claimed that a realistic scenario in which cross-validation fails is still an open question. The work I present here covers the middle ground between these theoretical and empirical results by allowing classifier specific results based on moment analysis. Such an endeavor is important since the gap between theoretical and empirical results is significant Langford [2005]. Preliminary work of this nature was done in Braga-Neto and Dougherty [2005] where the authors characterized the discrete histogram rule. However, their analysis does not provide any indication of how other more popular algorithms can be characterized in similar fashion keeping in mind scalability and accuracy. Specific classification schemes such as the W-statistic Anderson [2003] have been characterized in the past, but such analysis is very much limited to that and other similar statistics. The methodology I present here may potentially be applicable to large variety of learning algorithms. 1.3 Methodology What is the Methodology? The methodology for studying classification models consists in studying the behavior of the first two central moments of the GE of the classification algorithm studied. The moments are taken over the space of all possible classifiers produced by the classification algorithm, by training it over all possible datasets sampled independently and identically (i.i.d.) from some distribution. The first two moments give enough information about the statistical behavior of the classification algorithm to allow interesting observations about its behavior/trends. Higher moments may be computed using the same strategy suggested but might prove to be inefficient to compute Why have such a Methodology? The answers to the following questions shed light on why the methodology is necessary if tight statistical characterization is to be provided for classification algorithms. 18

19 1. Why study GE? The biggest danger of learning is overfitting the training data. The main idea in using GE as a measure of success of learning instead on the empirical error on a given dataset is to provide a mechanism to avoid this pitfall. Implicitly, by analyzing GE all the input is considered. 2. Why study the moments instead of the distribution of GE? Ideally, I would study the distribution of GE instead of moments in order to get a complete picture of what is its behavior. Studying the distribution of discrete random variables, except for very simple cases, turns out to be very hard. The difficulty comes from the fact that even computing the pdf in a single point is intractable since all combinations of random choices that result in the same value for GE have to be enumerated. On the other hand, the first two central moments coupled with distribution independent bounds such as Chebychev and Chernoff give guarantees about the worst possible behavior that are not too far from the actual behavior (small constant factor). Interestingly, it is possible to compute the moments of a random variable like GE without ever explicitly writing or making use of the formula for the pmf/pdf. What makes such an endeavor possible is extensive use of the linearity of expectation as is explained later. 3. Why characterize a class of classifiers instead of a single classifier? While the use of GE as the success measure is standard practice in Machine Learning, characterizing classes of classifiers instead of the particular classifier produced on a given dataset is not. From the point of view of the analysis, without large testing datasets it is not possible to evaluate directly GE for a particular classifier. By considering classes of classifiers to which a classifier belongs, an indirect characterization is obtained for the particular classifier. This is precisely what Statistical Learning Theory (SLT) does; there the class of classifiers consists in all classifiers with the same VC dimension. The main problem with SLT results is that classes based on VC dimension are too large, thus results tend to be pessimistic. In the methodology, the class of classifiers consists only of the classifiers that are produced by the given classification algorithm from datasets of fixed size from the underlying distribution. This is the smallest probabilistic class in which the particular classifier produced on a given dataset can be placed in How do I Implement the Methodology? One way of approximately estimating the moments of GE over all possible classifiers for a particular classification algorithm is by directly using Monte Carlo. If I use Monte Carlo directly, I first need to produce a classifier on a sampled dataset then test on a number of test sets sampled from the same distribution acquiring an estimate of the GE of this classifier. Repeating this entire procedure a couple of times I would acquire estimates of GE for different classifiers. Then by averaging the error of these multiple classifiers I 19

20 would get an estimate of the first moment of GE. The variance of GE can also be similarly estimated. Another way of estimating the moments of GE, is by obtaining parametric expressions for them. If this can be accomplished the moments can be computed exactly. Moreover, by dexterously observing the manner in which expressions are derived for a particular classification algorithm, insights can be gained into analyzing other algorithms of interest. Though deriving the expressions may be a tedious task, using them I obtain highly accurate estimates of the moments. I propose this second alternative for analyzing models. The key to the analysis is focusing on the learning and inference phases of the algorithm. In cases where the parametric expressions are computationally intensive to compute directly, I show that approximating individual terms using optimization techniques and even Monte Carlo I obtain accurate estimates of the moments when compared to directly using Monte Carlo (first alternative) for the same computational cost. If the moments are to be studied on synthetic data then the distribution is anyway assumed and the parametric expressions can be directly used. If I have real data an empirical distribution can be built on the dataset and then the parametric expressions can be used. 1.4 Applying the Methodology It is important to note that the methodology is not aimed towards providing a way of estimating bounds for GE of a classifier on a given dataset. The primary goal is creating an avenue in which learning algorithms can be studied precisely i.e. studying the statistical behavior of a particular algorithm w.r.t. a chosen/built distribution. Below, I discuss the two most important perspectives in which the methodology can be applied Algorithmic Perspective If a researcher/practitioner designs a new classification algorithm, he/she needs to validate it. Standard practice is to validate the algorithm on a relatively small (5-20) number of datasets and to report the performance. By observing the behavior of only a 20

21 few instances of the algorithm the designer infers its quality. Moreover, if the algorithm under performs on some datasets, it can be sometimes difficult to pinpoint the precise reason for its failure. If instead he/she is able to derive parametric expressions for the moments of GE, the test results would be more relevant to the particular classification algorithm, since the moments are over all possible datasets of a particular size drawn i.i.d. from some chosen/built distribution. Testing individually on all these datasets is an impossible task. Thus, by computing the moments using the parametric expressions the algorithm would be tested on a plethora of datasets with the results being highly accurate. Moreover, since the testing is done in a controlled environment i.e. all the parameters are known to the designer while testing, he/she can precisely pinpoint the conditions under which the algorithm performs well and the conditions under which the algorithm under performs Dataset Perspective If an algorithm designer validates his/her algorithm by computing moments as mentioned earlier, it can instill greater confidence in the practitioner searching for an appropriate algorithm for his/her dataset. The reason for this being, if the practitioner has a dataset which has a similar structure or is from a similar source as the test dataset on which an empirical distribution was built and favourable results reported by the designer, then this would mean that the results apply not only to that particular test dataset, but to other similar type of datasets and since the practitioner s dataset belongs to this similar collection, the results would also apply to his. Note that a distribution is just a weighting of different datasets and this perspective is used in the above exposition. If the dataset is categorical, it can be precisely modelled by a multinomial distribution in the following manner. A multinomial is completely characterized by the probabilities in each of its cells (which sum to 1) and the total count N (sum of individual cell counts). The designer can set the number of cells in the multinomial to be the number of cells in his contingency table, with empirical estimates for the individual cell probabilities being 21

22 the corresponding cell counts divided by the size of the dataset which is the value of N. With this I have a fully specified multnomial distribution using which I can compute the formulations, consequently characterizing the moments of the GE. Since the estimates for the cell probabilities are based on the available dataset, the true underlying distribution of which this dataset is a sample, may have slightly different values. This scenario can be accounted for, by varying the cell probabilities to a desired degree and observing the variation in the estimates of GE. This would assist in deciphering the sensitivity of the model in question to noise. In the continuous case, there is no such generic distribution (as the multinomial), but a popular choice could be a mixture of Gaussians (other distributions could also be used). 1.5 Research Goals In this section I state the specific research goals that I have accomplished in this thesis work. General Framework: To provide a statistical characterization of classifiers, a probabilistic class of classifiers that contains the desired classifier has to be considered since the behavior of any particular classifier can be arbitrarily poor. The class considered by statistical learning theory is the class of classifiers with a given VC dimension Vapnik [1998]. While the results thus obtained are very general, no particularity of the classification algorithm is exploited. The class of classifiers considered in this thesis is the classifiers obtained by applying the classification algorithm to a dataset of given size sampled i.i.d. from the underlying distribution. This leads to a different way of characterizing classifiers based on moment analysis. I develop a framework to analyze classification algorithms. Analysis of Model Selection Measures: To relate the moments of the Generalization error (GE) to the moments of Cross-validation error (CE), Leave-one-out error (LE) and Hold-out-set error (HE). This will assist us in studying the behavior of these errors given the moments of any one of these errors. 22

23 Analysis of Specific Classification Models: To develop customized formulations for the moments for specific classification algorithms. This will aid in studying classification algorithms in conjunction with the selection measures. I choose the following models which are a mix of parametric and non-parametric models. 1. Naive Bayesian Classifier (NBC) model: NBC is a model which is extensively used in industry, due to its robustness outperforming its more sophisticated counterparts in many real world applications(eg. spam filtering in Mozilla Thunderbird and Microsoft Outlook, bio-informatics etc.). There has been work on the robustness of NBC Domingos and Pazzani [1997], Rish [2001], but the proposed framework and the inter-relationships between the moments of the various errors helps us to extensively study not just the model but also the behavior of the validation methods in conjunction with it. 2. Decision Trees (DT) model: Decision trees are also extensively used in data mining and machine learning applications. Besides performance, they are sometimes preferred over other models(eg. Support Vector Machines, neural nets) because the process by which the eventual classifier is built from the sample is transparent. The probabilistic formulations will incorporate various pruning conditions such as purity, scarcity and fixed height. The formulations will help better understand the behavior of these trees for classification. 3. K-Nearest-Neighbor (KNN) Classifier model: This model is one of the more simpler models but yet it is highly effective. Theoretical results exist Stone [1977], regarding convergence of the Generalization Error (GE) of this algorithm to Bayes error (best possible performance). However, this result is asymptotic and for finite sample sizes in real scenarios finding the optimal value of K is more of an art than science. The methodology proposed by us can used to study the algorithm for different values of K and for different distance metrics accurately in controlled settings. Scalability: To make the computation of the moments scalable. This is especially relevant when the domain is discrete and the computation of individual probabilities becomes computationally intensive. In these cases I have to come up with approximation techniques that are accurate and fast, making the analysis practical. Practical Study of Non-asymptotic Behavior of NBC, DT, KNN and Selection Measures: The formulas of the moments of GE and consequently HE, CE and LE for the NBC, DT and KNN that are derived using the general framework can be used to carry out an extensive study of the behavior of these classification algorithms in conjunction 23

24 with the model selection measures. I have carried out such a comparison with the aim of identifying interesting trends about the mentioned classification algorithms and the model selection measures to exemplify the utility of the theoretical framework. 24

25 CHAPTER 2 GENERAL FRAMEWORK Probability distributions completely characterize the behavior of a random variable. Moments of a random variable give us information about its probability distribution. Thus, if I have knowledge of the moments of a random variable I can make statements about its behavior. In some cases characterizing a finite subset of moments may prove to be a more desired alternative than characterizing the entire distribution which can be wild and computationally expensive to compute. This is precisely what I do when I study the behavior of the generalization error of a classifier and the error estimation methods viz. Hold-out-error, Leave-one-out error and Cross-validation error. Characterizing the distribution though possible, can turn out to be a tedious task and studying the moments instead is a more viable option. As a result, I employ moment analysis and use linearity of expectation to explore the relationship between various estimates for the error of classifiers: generalization error(ge), hold-out-set error(he), and cross validation error(ce) leave-one-out error is just a particular case of CE and I do not analyze it independently. The relationships are drawn by going over the space of all possible datasets. The actual computation of moments though is conducted by going over the space of classifiers induced by a particular classification algorithm and i.i.d. data. This is done since it leads to computational efficiency. I interchangably go over the space of datasets and space of classifiers as deemed appropriate, since the classification algorithm is assumed to be deterministic. That is I have, E D(N) [F(ζ[D(N)])] = E Z(N) [F(ζ)] = E x1... x m [F(ζ(x 1, x 2,..., x m ))] where F() is some function that operates on a classifier. I also consider the learning algorithms to be symmetric(the algorithm is oblivious to random permutations of the samples in the training dataset). Throughout this section and in the rest of the thesis I use the notation in Table 2-1 unless stated otherwise. 25

26 2.1 Generalization Error (GE) The notion of generalization error is defined with respect to an underlying probability distribution defined over the input output space and a loss function (error metric). I model this probability space with the random vector X for input and random variable Y for output. When the input is fixed, Y (x) is the random variable that models the output 1. I assume in this thesis that the domain X of X is discrete; all the theory can be extended to continuous essentially by replacing the counting measure with Lebesque measure and sums with integrals. Whenever the probability and expectation is with respect to this probabilistic space(i.e. (X, Y )) that models the problem I will not use any index. For other probabilistic spaces, I will specify by an index what is the probability space I refer to. I denote the error metric by λ(a, b); in this thesis I will use only the 0-1 metric that takes value 1 if a b and 0 otherwise. With this, the generalization error of a classifier ζ is: GE(ζ) = E [λ(ζ(x), Y )] = P [ζ(x) Y ] (2 1) = x X P [X =x] P [ζ(x) Y (x)] where I used the fact that, for 0-1 loss function the expectation is the probability that the prediction is erroneous. Notice that the notation using Y (x) is really a conditional on X =x. I use this notation since it is intuitive and more compact. The last equation for the generalization error is the most useful in this thesis since it decomposes a global measure, generalization error, defined over the entire space into micro measures, one for each input. 1 By modeling the output for a given input as a random variable, I allow the output to be randomized, as it might be in most real circumstances. 26

27 By carefully selecting the class of classifiers for which the moment analysis of the generalization error is performed, meaningful and relevant probabilistic statements can be made about the generalization error of a particular classifier from this class. The probability distribution over the classifiers will be based on the randomness of the data used to produce the classifier. To formalize this, let Z(N) be the class of classifiers built over a dataset of size N with a probability space defined over it. With this, the k-th moment around 0 of the generalization error is: [ E ] Z(N) GE(ζ) k = P Z [ζ] GE(ζ) k ζ Z(N) The problem with this definition is that it talks about global characterization of classifiers which can be hard to capture. I rewrite the formulae for the first and second moment in terms of fine granularity structure of the classifiers. While deriving these moments, I have to consider double expectations of the form: E Z(N) [E [F(x, ζ)]] with F(x, ζ) a function that depends both on the input x and the classifier. With this I arrive at the following result: E Z(N) [E [F(x, ζ)]] = P Z(N) [ζ] P [X =x] F(x, ζ) ζ Z(N) x X = P [X =x] P Z(N) [ζ] F(x, ζ) x X ζ Z(N) = E [ E Z(N) [F(x, ζ)] ] (2 2) that uses the fact that P [X =x] does not depend on a particular ζ and P Z(N) [ζ] does not depend on a particular x, even though both quantities depend on the underlying probability distribution. Using the definition of the moments above, Equation 2 1 and Equation 2 2 I have the following theorem. 27

28 Theorem 1. The first and second moment of GE are given by, E Z(N) [GE(ζ)] = x X P [X =x] y Y P Z(N) [ζ(x)=y] P [Y (x) y] and E Z(N) Z(N) [GE(ζ)GE(ζ )] = P [X =x] P [X =x ] x X x X P Z(N) Z(N) [ζ(x)=y ζ (x )=y ] y Y y Y P [Y (x) y] P [Y (x ) y ] Proof. E Z(N) [GE(ζ)] = E Z(N) [E [λ(ζ(x), Y )]] = E [ E Z(N) [λ(ζ(x), Y )] ] = P [X =x] x X = P [X =x] x X P Z(N) [ζ] P [ζ(x) Y (x) ζ] ζ Z(N) ζ Z(N) = P [X =x] x X y Y P Z(N) [ζ] P [ζ(x) = y, Y (x) y ζ] P Z(N) [ζ] P [ζ(x) = y, Y (x) y ζ] ζ Z(N) ζ(x)=y = x X P [X =x] y Y P Z(N) [ζ(x)=y] P [Y (x) y] E Z(N) Z(N) [GE(ζ)GE(ζ )] = E Z(N) Z(N) [E [λ(ζ(x), Y )] E [λ(ζ (X), Y )]] ( ) = P Z(N) Z(N) [ζ, ζ ] P [X =x] P [ζ(x) Y (x)] (ζ,ζ ) Z(N) Z(N) x X ( ) P [X =x] P [ζ (x) Y (x)] x X 28

29 = x X P [X =x] P [X =x ] P Z(N) Z(N) [ζ, ζ ] x X (ζ,ζ ) Z(N) Z(N) P [ζ(x) Y (x)] P [ζ (x ) Y (x )] = P [X =x] P [X =x ] x X x X P Z(N) Z(N) [ζ(x)=y ζ (x )=y ] y Y y Y P [Y (x) y] P [Y (x ) y ] In both series of equations I made the transition from a summation over the class of classifiers to a summation over the possible outputs since the focus changed from the classifier to the prediction of the classifier for a specific input(x is fixed inside the first summation). What this effectively does is it allows the computation of moments using only local information(behavior on particular inputs) not global information(behavior on all inputs). This results in speeding the process of computing the moments. 2.2 Alternative Methods for Computing the Moments of GE The method I introduced above for computing the moments of the generalization error are based on decomposing the moment into contributions of individual input-output pairs. With such a decomposition, not only the analysis becomes simpler, but the complexity of the algorithm required is reduced. In particular, the complexity of computing the first moment is proportional to the size of the input-output space and the complexity of estimating probabilities of the form P Z [ζ(x)=y]. The complexity of the second moment is quadratic in the size of the input-output space and proportional to the complexity of estimating P Z [ζ(x)=y ζ(x )=y ]. To see the advantage of this method, I compare it with the other two alternatives for computing the moments: definition based computation and Monte Carlo simulation. Definition based computation uses the definition of expectation. It consists in summing over all possible datasets and multiplying the generalization error of the classifier 29

30 built from the dataset with the probability to obtain the dataset as an i.i.d. sample from the underlying probability distribution. Formally, E D(N) [GE(ζ)] = P [D] GE(ζ[D]) (2 3) D D(N) where D(N) is the set of all possible datasets of size N. The number of possible datasets is exponential in N with the base of the exponent proportional to the size of the input-output space (the product of the sizes of the domains of inputs and outputs). Evaluating the moments in this manner is impractical for all but very small spaces and dataset sizes. Monte Carlo simulation is a simple way to estimate moments that consists in performing experiments to produce samples that determine the value of the generalization error. In this case, to estimate E D(N) [GE(ζ)], datasets of size N have to be generated, one for each sample desired. For each of these datasets a classifier has to be constructed according to the classifier construction algorithm. For the classifier produced, samples from the underlying probability distribution have to be generated in order to estimate the generalization error of this classifier. Especially for second moments, the amount of samples required will be large in order to obtain reasonable accuracy for the moments. If a study has to be conducted in order to determine the influence of various parameters of the data generation model, the overall amount of experiments that have to performed becomes infeasible. In summary, the advantages of the method I propose for estimating the moments are: (a) the formulations are exact, (b) it needs only local behavior of the classifier, (c) so the time complexity is reduced and (d) does not depend on the fact that some of the probabilities are small. I will use this method to compute moments of the generalization error for the NBC, DT and KNN algorithms. 30

31 Table 2-1. Notation used throughout the thesis. Symbol Meaning X Random vector modeling input X Domain of random vector (input space) X Y Random variable modeling output Y (x) Random variable modeling output for input x Y Set of class labels (output space) D Dataset (x, y) Data-point from dataset D D t Training dataset D s Testing dataset D i The i th part/fold of D (for cross validation) N Size of dataset N t Size of training dataset N s Size of testing dataset v Number of folds of cross validation ζ Classifier ζ[d] Classifier build from dataset D GE(ζ) Generalization error of classifier ζ HE(ζ) Hold-out-set error of classifier ζ CE(ζ) Cross validation error of classifier ζ Z(S) The set of classifiers obtained by application of classification algorithm to an i.i.d. set of size S D(S) Dataset of size S E Z(S) [] Expectation w.r.t. the space of classifiers built on a sample of size S 31

Probabilistic Characterization of Nearest Neighbor Classifier

Probabilistic Characterization of Nearest Neighbor Classifier Noname manuscript No. (will be inserted by the editor) Probabilistic Characterization of Nearest Neighbor Classifier Amit Dhurandhar Alin Dobra Received: date / Accepted: date Abstract The k-nearest Neighbor

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University Chap 1. Overview of Statistical Learning (HTF, 2.1-2.6, 2.9) Yongdai Kim Seoul National University 0. Learning vs Statistical learning Learning procedure Construct a claim by observing data or using logics

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Insights into Cross-validation

Insights into Cross-validation Noname manuscript No. (will be inserted by the editor) Insights into Cross-alidation Amit Dhurandhar Alin Dobra Receied: date / Accepted: date Abstract Cross-alidation is one of the most widely used techniques,

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Outline Non-parametric approach Unsupervised: Non-parametric density estimation Parzen Windows Kn-Nearest

More information

BAYESIAN DECISION THEORY

BAYESIAN DECISION THEORY Last updated: September 17, 2012 BAYESIAN DECISION THEORY Problems 2 The following problems from the textbook are relevant: 2.1 2.9, 2.11, 2.17 For this week, please at least solve Problem 2.3. We will

More information

On the errors introduced by the naive Bayes independence assumption

On the errors introduced by the naive Bayes independence assumption On the errors introduced by the naive Bayes independence assumption Author Matthijs de Wachter 3671100 Utrecht University Master Thesis Artificial Intelligence Supervisor Dr. Silja Renooij Department of

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

A Posteriori Corrections to Classification Methods.

A Posteriori Corrections to Classification Methods. A Posteriori Corrections to Classification Methods. Włodzisław Duch and Łukasz Itert Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland; http://www.phys.uni.torun.pl/kmk

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007 Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

More information

A Simple Algorithm for Learning Stable Machines

A Simple Algorithm for Learning Stable Machines A Simple Algorithm for Learning Stable Machines Savina Andonova and Andre Elisseeff and Theodoros Evgeniou and Massimiliano ontil Abstract. We present an algorithm for learning stable machines which is

More information

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms 03/Feb/2010 VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms Presented by Andriy Temko Department of Electrical and Electronic Engineering Page 2 of

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES 1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses Steven Bergner, Chris Demwell Lecture notes for Cmpt 882 Machine Learning February 19, 2004 Abstract In these notes, a

More information

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization : Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage

More information

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1 Decision Trees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 5 th, 2007 2005-2007 Carlos Guestrin 1 Linear separability A dataset is linearly separable iff 9 a separating

More information

arxiv: v1 [cs.ds] 3 Feb 2018

arxiv: v1 [cs.ds] 3 Feb 2018 A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based

More information

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification Robin Senge, Juan José del Coz and Eyke Hüllermeier Draft version of a paper to appear in: L. Schmidt-Thieme and

More information

Stochastic Analogues to Deterministic Optimizers

Stochastic Analogues to Deterministic Optimizers Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured

More information

Multiclass Classification-1

Multiclass Classification-1 CS 446 Machine Learning Fall 2016 Oct 27, 2016 Multiclass Classification Professor: Dan Roth Scribe: C. Cheng Overview Binary to multiclass Multiclass SVM Constraint classification 1 Introduction Multiclass

More information

Machine Learning. Probabilistic KNN.

Machine Learning. Probabilistic KNN. Machine Learning. Mark Girolami girolami@dcs.gla.ac.uk Department of Computing Science University of Glasgow June 21, 2007 p. 1/3 KNN is a remarkably simple algorithm with proven error-rates June 21, 2007

More information

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES 1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability

More information

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

More information

Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Computational Genomics

Computational Genomics Computational Genomics http://www.cs.cmu.edu/~02710 Introduction to probability, statistics and algorithms (brief) intro to probability Basic notations Random variable - referring to an element / event

More information

Lecture 7: DecisionTrees

Lecture 7: DecisionTrees Lecture 7: DecisionTrees What are decision trees? Brief interlude on information theory Decision tree construction Overfitting avoidance Regression trees COMP-652, Lecture 7 - September 28, 2009 1 Recall:

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Classifier Complexity and Support Vector Classifiers

Classifier Complexity and Support Vector Classifiers Classifier Complexity and Support Vector Classifiers Feature 2 6 4 2 0 2 4 6 8 RBF kernel 10 10 8 6 4 2 0 2 4 6 Feature 1 David M.J. Tax Pattern Recognition Laboratory Delft University of Technology D.M.J.Tax@tudelft.nl

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

An Empirical Study of Building Compact Ensembles

An Empirical Study of Building Compact Ensembles An Empirical Study of Building Compact Ensembles Huan Liu, Amit Mandvikar, and Jigar Mody Computer Science & Engineering Arizona State University Tempe, AZ 85281 {huan.liu,amitm,jigar.mody}@asu.edu Abstract.

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann Machine Learning! in just a few minutes Jan Peters Gerhard Neumann 1 Purpose of this Lecture Foundations of machine learning tools for robotics We focus on regression methods and general principles Often

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20. 10-601 Machine Learning, Midterm Exam: Spring 2008 Please put your name on this cover sheet If you need more room to work out your answer to a question, use the back of the page and clearly mark on the

More information

BAYESIAN CLASSIFICATION OF HIGH DIMENSIONAL DATA WITH GAUSSIAN PROCESS USING DIFFERENT KERNELS

BAYESIAN CLASSIFICATION OF HIGH DIMENSIONAL DATA WITH GAUSSIAN PROCESS USING DIFFERENT KERNELS BAYESIAN CLASSIFICATION OF HIGH DIMENSIONAL DATA WITH GAUSSIAN PROCESS USING DIFFERENT KERNELS Oloyede I. Department of Statistics, University of Ilorin, Ilorin, Nigeria Corresponding Author: Oloyede I.,

More information

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag Decision Trees Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Supervised Learning Input: labelled training data i.e., data plus desired output Assumption:

More information

Final Exam, Fall 2002

Final Exam, Fall 2002 15-781 Final Exam, Fall 22 1. Write your name and your andrew email address below. Name: Andrew ID: 2. There should be 17 pages in this exam (excluding this cover sheet). 3. If you need more room to work

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

IMBALANCED DATA. Phishing. Admin 9/30/13. Assignment 3: - how did it go? - do the experiments help? Assignment 4. Course feedback

IMBALANCED DATA. Phishing. Admin 9/30/13. Assignment 3: - how did it go? - do the experiments help? Assignment 4. Course feedback 9/3/3 Admin Assignment 3: - how did it go? - do the experiments help? Assignment 4 IMBALANCED DATA Course feedback David Kauchak CS 45 Fall 3 Phishing 9/3/3 Setup Imbalanced data. for hour, google collects

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth N-fold cross validation Instead of a single test-training split: train test Split data into N equal-sized parts Train and test

More information

Structure learning in human causal induction

Structure learning in human causal induction Structure learning in human causal induction Joshua B. Tenenbaum & Thomas L. Griffiths Department of Psychology Stanford University, Stanford, CA 94305 jbt,gruffydd @psych.stanford.edu Abstract We use

More information

VC-dimension for characterizing classifiers

VC-dimension for characterizing classifiers VC-dimension for characterizing classifiers Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to

More information

PAC Generalization Bounds for Co-training

PAC Generalization Bounds for Co-training PAC Generalization Bounds for Co-training Sanjoy Dasgupta AT&T Labs Research dasgupta@research.att.com Michael L. Littman AT&T Labs Research mlittman@research.att.com David McAllester AT&T Labs Research

More information

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD

Bayesian decision theory. Nuno Vasconcelos ECE Department, UCSD Bayesian decision theory Nuno Vasconcelos ECE Department, UCSD Notation the notation in DHS is quite sloppy e.g. show that ( error = ( error z ( z dz really not clear what this means we will use the following

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

VC-dimension for characterizing classifiers

VC-dimension for characterizing classifiers VC-dimension for characterizing classifiers Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality Weiqiang Dong 1 The goal of the work presented here is to illustrate that classification error responds to error in the target probability estimates

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction 15-0: Learning vs. Deduction Artificial Intelligence Programming Bayesian Learning Chris Brooks Department of Computer Science University of San Francisco So far, we ve seen two types of reasoning: Deductive

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata Principles of Pattern Recognition C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata e-mail: murthy@isical.ac.in Pattern Recognition Measurement Space > Feature Space >Decision

More information

The Bayes classifier

The Bayes classifier The Bayes classifier Consider where is a random vector in is a random variable (depending on ) Let be a classifier with probability of error/risk given by The Bayes classifier (denoted ) is the optimal

More information

Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Naïve Bayes Matt Gormley Lecture 18 Oct. 31, 2018 1 Reminders Homework 6: PAC Learning

More information

Conditional probabilities and graphical models

Conditional probabilities and graphical models Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within

More information

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Introduction to ML Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Why Bayesian learning? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

Lecture 9: Bayesian Learning

Lecture 9: Bayesian Learning Lecture 9: Bayesian Learning Cognitive Systems II - Machine Learning Part II: Special Aspects of Concept Learning Bayes Theorem, MAL / ML hypotheses, Brute-force MAP LEARNING, MDL principle, Bayes Optimal

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Lossless Online Bayesian Bagging

Lossless Online Bayesian Bagging Lossless Online Bayesian Bagging Herbert K. H. Lee ISDS Duke University Box 90251 Durham, NC 27708 herbie@isds.duke.edu Merlise A. Clyde ISDS Duke University Box 90251 Durham, NC 27708 clyde@isds.duke.edu

More information

Bias Correction in Classification Tree Construction ICML 2001

Bias Correction in Classification Tree Construction ICML 2001 Bias Correction in Classification Tree Construction ICML 21 Alin Dobra Johannes Gehrke Department of Computer Science Cornell University December 15, 21 Classification Tree Construction Outlook Temp. Humidity

More information