Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012
|
|
- Ambrose Smith
- 6 years ago
- Views:
Transcription
1 Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 202 Servane Gey July 24, 202 Abstract The Vapnik-Chervonenkis (VC) dimension of the set of half-spaces of R d with frontiers parallel to the axes is computed exactly. It is shown that it is much smaller than the intuitive value of d. A good approximation based on the Stirling s formula proves that it is more likely of the order log 2 d. This result may be used to evaluate the performance of classifiers or regressors based on dyadic partitioning of R d for instance. Algorithms using axis-parallel cuts to partition R d are often used to reduce the computational time of such estimators when d is large. Keywords: Vapnik-Chervonenkis dimension, axis-parallel cuts. MSC 200 Classification: 62G99 62H99 Introduction The VC dimension of a set of subsets has been introduced by Vapnik and Chervonenkis [9, 0] to measure its complexity. The VC dimension of a realvalued function space F is then the VC dimension of x; f(x) 0}; f F}. In particular, the VC dimension of sets of classifiers or regressors appears commonly in the statistical learning area when evaluating their performance. For example, Vapnik s theory in the classification framework is now widely known (see [3] for instance): let (X, Y ) be a couple of variables taking values in R d 0; }, and let L be a sample of n independent replications of (X, Y ). If ˆf is a classifier minimizing the average misclassification rate of L on a set Laboratoire MAP5 - UMR 845, Université Paris Descartes, Paris Cedex 06, France - Servane.Gey@parisdescartes.fr
2 of classifiers having finite VC dimension V, then, without further assumption on the distribution P of (X, Y ), the performance of ˆf is evaluated as follows: ( )] E L [P ˆf(X) Y C bias 2 ( ˆf) V + C 2 n, () where E L denotes the expectation with respect to the sample distribution, bias( ˆf) denotes the bias of the classifier ˆf, and C and C 2 are absolute constants. Functional estimates defined on partitions of R d are often used to estimate relationships between two variables X R d and Y 0; } or Y R (such as histograms, piecewise polynomials, or splines for example). In many cases, the VC dimension of the set of subsets used to construct the partition appears inside risk bounds when evaluating the performance of such estimators. For example, if the set used is the set of all half-spaces of R d, often its VC dimension d + has to be taken into account. When d is large, it is often computationally easier to construct partitions using axis-parallel cuts. For example, some theoretical developments on dyadic partitions of R 2 are given in [4, ], and the VC dimension of axis-parallel cuts appears more particularly in the results obtained on the performance of classification and regression binary decision trees (CART) introduced by Breiman et. al [2] in 984, and theoretically studied in [8, 7, 5, 6]. 2 Reminder about VC Dimension The VC dimension of a set A of subsets of some measurable space X is based on counting the number of intersects of A with a finite set of fixed points in X. Definition (Vapnik-Chervonenkis Dimension). Let A be a set of subsets of some measurable space X. Then (x,..., x n ) X n will be said to be shattered by A if all subsets of x ;... ; x n } are covered by A, that is if x,..., x n } A ; A A} = 2 n. The Vapnik-Chervonenkis dimension V C(A) of A is then defined as the maximal integer n such that there exists n points in X shattered by A, i.e. V C(A) = max n ; If no such n exists, then V C(A) = +. max x,..., x n } A ; A A} = 2 n (x,...,x n) X n }. 2
3 Thus, it is easily seen that the larger V C(A), the more complex A. For example, if A = ] ; x] ; x R}, then V C(A) = ; or if A is the set of all half-spaces in R d, then V C(A) = d +. Since axis-parallel cuts is a subset of the set of all half-spaces in R d, it could be natural to think that its VC dimension is of order d. Actually, it is shown in what follows that it is of order log 2 d. 3 VC Dimension of axis-parallel cuts We give a formula to compute the VC dimension of axis-parallel cuts in R d. Since the obtained formula is not always easy to handle, an approximation is also given. Lemma. Let Then A d = } x R d ; x i a}; i =,..., d, a R. V C(A d ) = max n ; ( ) } n d, where x denotes the integer part of x. Furthermore, the following approximation of V C(A d ) is available for all d 2: log d log V C(A d) log ( ) d d log 2 Figure shows that V C(A d ) is a piecewise constant function of the space dimension d, which increases at a rate much smaller than the intuitive value of d. It also shows that the bounds computed from the Stirling s formula are sharp. Proof. The idea is that, to have n points (x,..., x n ) shattered by A d, all the subsets of x,..., x n } should be covered by A d. But, if there exists p n such that there is more than d+ subsets of x,..., x n } having p elements, then A d will miss at least ( n p) d subsets: let n and (x,..., x n ) be n points in R d. Suppose that n is such that > d. This means that there are at least d + subsets of x,..., x n } of size. For each 3
4 Figure : V C(A d ) with respect to the space dimension d and Stirling s bounds. coordinate i =,..., d, let us denote by x i(.) the ordered statistic computed from the i th coordinate of (x,..., x n ), that is, for all i =,..., d, Let p = and let x i i() xi i(2)... xi i(n). B p = x i() ;... ; x i(p) } ; i =,..., d and x i() ;... ; x i(p) } = p }, B c p = B x,..., x n }; B = p and B / B p }. Hence B p is covered by A d (by simply taking A = x i (x i i(p) + xi i(p+) )/2} for each coordinate), and we have that: B p d and Bp c d > 0. p Let B Bp c and A = x i a} A d. If x,..., x n } A p, then x,..., x n } A B. Else, since x,..., x n } A = x j ; x i j a}, we have that x i i(j) a for all j =,..., p, and xi i(j) > a for all j = p +,..., n. So x,..., x n } A = x i() ;... ; x i(p) } and x i() ;... ; x i(p) } = p, leading to x,..., x n } A B p, and then to x,..., x n } A B. So, for all B Bp c 4
5 and all ( A A) d, x,..., x n } A B. n So, if > d, (x,..., x n ) can not be shattered by A d. Thus V C(A d ) max n ; ( ) } n d. Let n such that d. Let (x,..., x n ) be n points of R d defined as follows: for each coordinate i =,..., ( n ), let i ;... ; i } be the i th subset of indices in ;... ; n}, where the indices are denoted in ascending order, i.e.: i <... < i n. ( ) ( ) n n Since d, we obtain distinct subsets of indices. Hence we take for each such coordinate x i i k = k. Then the remaining values of (x,..., x n ) are taken as follows: ( ) n Since d, for each subset i ;... ; i + + } of ;... ; n} with + elements, there exists i ;... ; ( n ) } such that i ;... ; i } = i ;... ; i }. Then take xi i + = +. Let us note that, if n is odd, there is a bijection between i and i. Let j ;... ; j m } = j / i ;... ; i + }}, with j <... < j m, and let j 0 = i +. Then take x i j k = x i j k +. If not filled, the last coordinates are set to be equal to n. Hence, we obtain that, for all j / i ;... ; i }, x i j +. Then (x,..., x n ) is shattered by A d : for p 0;... ; n}, let B = x i ;... ; x ip } x,..., x n }, with i < i 2 <... < i p n as soon as p 0. If p = 0, let i 0 = argmin i d min x i j j, and take A = x i 0 min j x i 0 j }. Then B = x,..., x n } A =. If p = n, let i n = argmax i d max x i j, j 5
6 and take A = x in max j x in j + }. Then B = x,..., x n } A = x,..., x n }. If 0 < p, let A A d be the subset defined by A = x i p + /2}, with i the coordinate corresponding to a subset of indices i ;... ; i } containing i ;... ; i p }. Then, by definition of (x i,..., xi n), B = x,..., x n } A. If + p < n, let i be the coordinate corresponding to the configuration i ;... ; i + } (as defined by (x,..., x n )). Let A A d be the subset defined by A = x i p + /2}. Then, by definition of (x i,..., xi n), B = x,..., x n } A. Thus V C(A d ) max n ; ( ) } n d. Then, the lower and upper bounds of V C(A d ) are computed by using the Stirling s formula: for all n we have 2πe (n+) (n + ) n+ 2 n! 2πe (n+) e 2(n+) (n + ) n+ 2. A simple calculation gives the following: if n is even, then n/2 and if n is odd, then Thus, if then d. V C(A d ). On the other hand, if n/2 e 2(n+) e 2(n+) e n+ (n + )n+/2 2 2π e 2π 2 36 (n + 2) n+ e+ 2 n+, 6π n+ (n + )(n+)/2 (n + 3) e π 2 n+ d, n/2+ e+ 24 2π 2 n. Taking the logarithm leads to the lower bound of d, we have that, if n is even, e e 3n+6 n+ (n + )n+/2 2 2π 2 (n + 2) n+ e 2 n+, 2π d + 2 6
7 and if n is odd, ( n ) e n+2 e 3(n+3)(n+) 2 2π n+ (n + )(n+)/2 (n + 3) e 8 n/2+ 2 n+. 2π d + 3 e 8 Thus, since, for all n such that d, 2 n+ d, the upper 2π bound of V C(A d ) is found by taking the logarithm of this last expression. References [] Akakpo, N. Adaptation to anisotropy and inhomogeneity via dyadic piecewise polynomial selection. Mathematical Methods of Statistics 2, (202), 28. [2] Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. Classification And Regression Trees. Chapman & Hall, 984. [3] Devroye, L., Györfi, L., and Lugosi, G. A probabilistic theory of pattern recognition, vol. 3 of Applications of Mathematics (New York). Springer-Verlag, New York, 996. [4] Donoho, D. L. CART and best-ortho-basis : A connection. The Annals of Statistics 25, 5 (997), [5] Gey, S. Risk bounds for cart classifiers under a margin condition. Pattern Recognition 45 (202), [6] Gey, S., and Mary Huard, T. Risk bounds for embedded variable selection in classification trees. Tech. rep., arxiv, v, 20. [7] Gey, S., and Nedelec, E. Model selection for CART regression trees. IEEE Trans. Inform. Theory 5, 2 (2005), [8] Nobel, A. B. Analysis of a complexity-based pruning scheme for classification trees. IEEE Trans. Inform. Theory 48, 8 (2002), [9] Vapnik, V. N., and Chervonenkis, A. Y. Theory of uniform convergence of frequencies of events to their probabilities and problems of search for an optimal solution from empirical data. Avtomat. i Telemeh., 2 (97),
8 [0] Vapnik, V. N., and Chervonenkis, A. Y. Teoriya raspoznavaniya obrazov. Statisticheskie problemy obucheniya. Izdat. Nauka, Moscow,
Dyadic Classification Trees via Structural Risk Minimization
Dyadic Classification Trees via Structural Risk Minimization Clayton Scott and Robert Nowak Department of Electrical and Computer Engineering Rice University Houston, TX 77005 cscott,nowak @rice.edu Abstract
More informationRisk Bounds for CART Classifiers under a Margin Condition
arxiv:0902.3130v5 stat.ml 1 Mar 2012 Risk Bounds for CART Classifiers under a Margin Condition Servane Gey March 2, 2012 Abstract Non asymptotic risk bounds for Classification And Regression Trees (CART)
More informationBINARY TREE-STRUCTURED PARTITION AND CLASSIFICATION SCHEMES
BINARY TREE-STRUCTURED PARTITION AND CLASSIFICATION SCHEMES DAVID MCDIARMID Abstract Binary tree-structured partition and classification schemes are a class of nonparametric tree-based approaches to classification
More informationAn Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI
An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process
More informationRandom forests and averaging classifiers
Random forests and averaging classifiers Gábor Lugosi ICREA and Pompeu Fabra University Barcelona joint work with Gérard Biau (Paris 6) Luc Devroye (McGill, Montreal) Leo Breiman Binary classification
More informationGeneralization and Overfitting
Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle
More informationConsistency of Nearest Neighbor Methods
E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study
More informationMULTIVARIATE HISTOGRAMS WITH DATA-DEPENDENT PARTITIONS
Statistica Sinica 19 (2009), 159-176 MULTIVARIATE HISTOGRAMS WITH DATA-DEPENDENT PARTITIONS Jussi Klemelä University of Oulu Abstract: We consider estimation of multivariate densities with histograms which
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationEVALUATING MISCLASSIFICATION PROBABILITY USING EMPIRICAL RISK 1. Victor Nedel ko
94 International Journal "Information Theories & Applications" Vol13 [Raudys, 001] Raudys S, Statistical and neural classifiers, Springer, 001 [Mirenkova, 00] S V Mirenkova (edel ko) A method for prediction
More informationOn Learnability, Complexity and Stability
On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly
More informationUsing CART to Detect Multiple Change Points in the Mean for large samples
Using CART to Detect Multiple Change Points in the Mean for large samples by Servane Gey and Emilie Lebarbier Research Report No. 12 February 28 Statistics for Systems Biology Group Jouy-en-Josas/Paris/Evry,
More informationDecision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro
Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,
More informationData Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018
Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model
More informationFuzzy histograms and density estimation
Fuzzy histograms and density estimation Kevin LOQUIN 1 and Olivier STRAUSS LIRMM - 161 rue Ada - 3439 Montpellier cedex 5 - France 1 Kevin.Loquin@lirmm.fr Olivier.Strauss@lirmm.fr The probability density
More informationA Simple Algorithm for Learning Stable Machines
A Simple Algorithm for Learning Stable Machines Savina Andonova and Andre Elisseeff and Theodoros Evgeniou and Massimiliano ontil Abstract. We present an algorithm for learning stable machines which is
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationStatistics and learning: Big Data
Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationOptimal L 1 Bandwidth Selection for Variable Kernel Density Estimates
Optimal L Bandwidth Selection for Variable Kernel Density Estimates Alain BERLINET, Gérard BIAU and Laurent ROUVIÈRE Institut de Mathématiques et de Modélisation de Montpellier, UMR CNRS 549, Equipe de
More informationDecision Tree Learning Lecture 2
Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over
More informationIntroduction to Machine Learning
Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical
More informationSolving Classification Problems By Knowledge Sets
Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose
More informationInfluence measures for CART
Jean-Michel Poggi Orsay, Paris Sud & Paris Descartes Joint work with Avner Bar-Hen Servane Gey (MAP5, Paris Descartes ) CART CART Classification And Regression Trees, Breiman et al. (1984) Learning set
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationCOMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization
: Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage
More informationDecision trees COMS 4771
Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).
More informationBoosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13
Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y
More informationComputational Learning Theory
Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms
More informationInternational Journal "Information Theories & Applications" Vol.14 /
International Journal "Information Theories & Applications" Vol.4 / 2007 87 or 2) Nˆ t N. That criterion and parameters F, M, N assign method of constructing sample decision function. In order to estimate
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationOptimal global rates of convergence for interpolation problems with random design
Optimal global rates of convergence for interpolation problems with random design Michael Kohler 1 and Adam Krzyżak 2, 1 Fachbereich Mathematik, Technische Universität Darmstadt, Schlossgartenstr. 7, 64289
More informationSTAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă
STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani
More informationREGRESSION TREE CREDIBILITY MODEL
LIQUN DIAO AND CHENGGUO WENG Department of Statistics and Actuarial Science, University of Waterloo Advances in Predictive Analytics Conference, Waterloo, Ontario Dec 1, 2017 Overview Statistical }{{ Method
More informationSupport Vector Machines
Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)
More informationAdaptive sparse grids
ANZIAM J. 44 (E) ppc335 C353, 2003 C335 Adaptive sparse grids M. Hegland (Received 1 June 2001) Abstract Sparse grids, as studied by Zenger and Griebel in the last 10 years have been very successful in
More informationOn the S-Labeling problem
On the S-Labeling problem Guillaume Fertin Laboratoire d Informatique de Nantes-Atlantique (LINA), UMR CNRS 6241 Université de Nantes, 2 rue de la Houssinière, 4422 Nantes Cedex - France guillaume.fertin@univ-nantes.fr
More informationMonomial transformations of the projective space
Monomial transformations of the projective space Olivier Debarre and Bodo Lass Abstract We prove that, over any field, the dimension of the indeterminacy locus of a rational map f : P n P n defined by
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationMidterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas
Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric
More informationA talk on Oracle inequalities and regularization. by Sara van de Geer
A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationFast learning rates for plug-in classifiers under the margin condition
Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,
More informationComputational and Statistical Learning theory
Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,
More informationMUTUAL INFORMATION (MI) specifies the level of
IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 7, JULY 2010 3497 Nonproduct Data-Dependent Partitions for Mutual Information Estimation: Strong Consistency and Applications Jorge Silva, Member, IEEE,
More informationModel Selection and Error Estimation
Model Selection and Error Estimation Peter L. Bartlett Stéphane Boucheron Computer Sciences Laboratory Laboratoire de Recherche en Informatique RSISE, Australian National University, Bâtiment 490 Canberra
More informationCognitive Cyber-Physical System
Cognitive Cyber-Physical System Physical to Cyber-Physical The emergence of non-trivial embedded sensor units, networked embedded systems and sensor/actuator networks has made possible the design and implementation
More informationLeast Absolute Shrinkage is Equivalent to Quadratic Penalization
Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationPattern Recognition and Machine Learning. Perceptrons and Support Vector machines
Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationRecap from previous lecture
Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience
More informationRandom projection ensemble classification
Random projection ensemble classification Timothy I. Cannings Statistics for Big Data Workshop, Brunel Joint work with Richard Samworth Introduction to classification Observe data from two classes, pairs
More informationStatistical and Computational Learning Theory
Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the
More informationA Bias Correction for the Minimum Error Rate in Cross-validation
A Bias Correction for the Minimum Error Rate in Cross-validation Ryan J. Tibshirani Robert Tibshirani Abstract Tuning parameters in supervised learning problems are often estimated by cross-validation.
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationarxiv: v1 [math.st] 14 Mar 2016
Impact of subsampling and pruning on random forests. arxiv:1603.04261v1 math.st] 14 Mar 2016 Roxane Duroux Sorbonne Universités, UPMC Univ Paris 06, F-75005, Paris, France roxane.duroux@upmc.fr Erwan Scornet
More informationHistogram Regression Estimation Using Data-dependent Partitions
Histogram Regression Estimation Using Data-dependent Partitions Andrew Nobel September 1995 Abstract We establish general sufficient conditions for the L 2 -consistency of multivariate histogram regression
More informationBagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7
Bagging Ryan Tibshirani Data Mining: 36-462/36-662 April 23 2013 Optional reading: ISL 8.2, ESL 8.7 1 Reminder: classification trees Our task is to predict the class label y {1,... K} given a feature vector
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationVapnik-Chervonenkis Dimension of Neural Nets
P. L. Bartlett and W. Maass: Vapnik-Chervonenkis Dimension of Neural Nets 1 Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley
More informationAdaptive Minimax Classification with Dyadic Decision Trees
Adaptive Minimax Classification with Dyadic Decision Trees Clayton Scott Robert Nowak Electrical and Computer Engineering Electrical and Computer Engineering Rice University University of Wisconsin Houston,
More informationMACHINE LEARNING. Support Vector Machines. Alessandro Moschitti
MACHINE LEARNING Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it Summary Support Vector Machines
More informationSUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1. Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities
SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1 Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities Jiantao Jiao*, Lin Zhang, Member, IEEE and Robert D. Nowak, Fellow, IEEE
More informationEnsemble Methods and Random Forests
Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization
More informationMaximal Width Learning of Binary Functions
Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics, London School of Economics, Houghton Street, London WC2A2AE, UK Joel Ratsaby Electrical and Electronics Engineering
More informationMachine Leanring Theory and Applications: third lecture
Machine Leanring Theory and Applications: third lecture Arnak Dalalyan ENSAE PARISTECH 12 avril 2016 Framework and notation Framework and notation We observe (X i, Y i ) X Y, i = 1,..., n independent randomly
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationDoes Modeling Lead to More Accurate Classification?
Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang
More informationRandom Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.
Random Forests One of the best known classifiers is the random forest. It is very simple and effective but there is still a large gap between theory and practice. Basically, a random forest is an average
More informationChapter 14 Combining Models
Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients
More informationTREE-STRUCTURED REGRESSION AND THE DIFFERENTIATION OF INTEGRALS 1. BY RICHARD A. OLSHEN Stanford University. In Memory of Leo Breiman,
The Annals of Statistics 2007, Vol. 35, No. 1, 1 12 DOI: 10.1214/009053606000001000 Institute of Mathematical Statistics, 2007 TREE-STRUCTURED REGRESSION AND THE DIFFERENTIATION OF INTEGRALS 1 BY RICHARD
More informationA Study of Relative Efficiency and Robustness of Classification Methods
A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics
More informationCSE 648: Advanced algorithms
CSE 648: Advanced algorithms Topic: VC-Dimension & ɛ-nets April 03, 2003 Lecturer: Piyush Kumar Scribed by: Gen Ohkawa Please Help us improve this draft. If you read these notes, please send feedback.
More informationOn threshold-based classification rules By Leila Mohammadi and Sara van de Geer. Mathematical Institute, University of Leiden
On threshold-based classification rules By Leila Mohammadi and Sara van de Geer Mathematical Institute, University of Leiden Abstract. Suppose we have n i.i.d. copies {(X i, Y i ), i = 1,..., n} of an
More informationCurve learning. p.1/35
Curve learning Gérard Biau UNIVERSITÉ MONTPELLIER II p.1/35 Summary The problem The mathematical model Functional classification 1. Fourier filtering 2. Wavelet filtering Applications p.2/35 The problem
More informationHarrison B. Prosper. Bari Lectures
Harrison B. Prosper Florida State University Bari Lectures 30, 31 May, 1 June 2016 Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 1 h Lecture 1 h Introduction h Classification h Grid Searches
More informationPerformance of Cross Validation in Tree-Based Models
Performance of Cross Validation in Tree-Based Models Seoung Bum Kim, Xiaoming Huo, Kwok-Leung Tsui School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, Georgia 30332 {sbkim,xiaoming,ktsui}@isye.gatech.edu
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationarxiv: v1 [cs.ds] 28 Sep 2018
Minimization of Gini impurity via connections with the k-means problem arxiv:1810.00029v1 [cs.ds] 28 Sep 2018 Eduardo Laber PUC-Rio, Brazil laber@inf.puc-rio.br October 2, 2018 Abstract Lucas Murtinho
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationECE521 Lecture7. Logistic Regression
ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard
More informationSUPPORT VECTOR MACHINE
SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition
More informationThe Vapnik-Chervonenkis Dimension
The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB 1 / 91 Outline 1 Growth Functions 2 Basic Definitions for Vapnik-Chervonenkis Dimension 3 The Sauer-Shelah Theorem 4 The Link between VCD and
More informationMachine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and
More informationConstructing Prediction Intervals for Random Forests
Senior Thesis in Mathematics Constructing Prediction Intervals for Random Forests Author: Benjamin Lu Advisor: Dr. Jo Hardin Submitted to Pomona College in Partial Fulfillment of the Degree of Bachelor
More informationCSCI 5622 Machine Learning
CSCI 5622 Machine Learning DATE READ DUE Mon, Aug 31 1, 2 & 3 Wed, Sept 2 3 & 5 Wed, Sept 9 TBA Prelim Proposal www.rodneynielsen.com/teaching/csci5622f09/ Instructor: Rodney Nielsen Assistant Professor
More informationSupport Vector Machines with a Reject Option
Support Vector Machines with a Reject Option Yves Grandvalet, 2, Alain Rakotomamonjy 3, Joseph Keshet 2 and Stéphane Canu 3 Heudiasyc, UMR CNRS 6599 2 Idiap Research Institute Université de Technologie
More informationEscaping the curse of dimensionality with a tree-based regressor
Escaping the curse of dimensionality with a tree-based regressor Samory Kpotufe UCSD CSE Curse of dimensionality In general: Computational and/or prediction performance deteriorate as the dimension D increases.
More informationSupport Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM
1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University
More informationProbabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities ; CU- CS
University of Colorado, Boulder CU Scholar Computer Science Technical Reports Computer Science Spring 5-1-23 Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities
More informationBoosting Methods: Why They Can Be Useful for High-Dimensional Data
New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,
More informationClassification via kernel regression based on univariate product density estimators
Classification via kernel regression based on univariate product density estimators Bezza Hafidi 1, Abdelkarim Merbouha 2, and Abdallah Mkhadri 1 1 Department of Mathematics, Cadi Ayyad University, BP
More informationNotes on Discriminant Functions and Optimal Classification
Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem
More informationGrowing a Large Tree
STAT 5703 Fall, 2004 Data Mining Methodology I Decision Tree I Growing a Large Tree Contents 1 A Single Split 2 1.1 Node Impurity.................................. 2 1.2 Computation of i(t)................................
More informationVariance Reduction and Ensemble Methods
Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis
More informationLearning with multiple models. Boosting.
CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models
More informationFoundations of Machine Learning
Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about
More informationSupervised Learning Part I
Supervised Learning Part I http://www.lps.ens.fr/~nadal/cours/mva Jean-Pierre Nadal CNRS & EHESS Laboratoire de Physique Statistique (LPS, UMR 8550 CNRS - ENS UPMC Univ. Paris Diderot) Ecole Normale Supérieure
More information