Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012

Size: px
Start display at page:

Download "Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 2012"

Transcription

1 Vapnik-Chervonenkis Dimension of Axis-Parallel Cuts arxiv: v2 [math.st] 23 Jul 202 Servane Gey July 24, 202 Abstract The Vapnik-Chervonenkis (VC) dimension of the set of half-spaces of R d with frontiers parallel to the axes is computed exactly. It is shown that it is much smaller than the intuitive value of d. A good approximation based on the Stirling s formula proves that it is more likely of the order log 2 d. This result may be used to evaluate the performance of classifiers or regressors based on dyadic partitioning of R d for instance. Algorithms using axis-parallel cuts to partition R d are often used to reduce the computational time of such estimators when d is large. Keywords: Vapnik-Chervonenkis dimension, axis-parallel cuts. MSC 200 Classification: 62G99 62H99 Introduction The VC dimension of a set of subsets has been introduced by Vapnik and Chervonenkis [9, 0] to measure its complexity. The VC dimension of a realvalued function space F is then the VC dimension of x; f(x) 0}; f F}. In particular, the VC dimension of sets of classifiers or regressors appears commonly in the statistical learning area when evaluating their performance. For example, Vapnik s theory in the classification framework is now widely known (see [3] for instance): let (X, Y ) be a couple of variables taking values in R d 0; }, and let L be a sample of n independent replications of (X, Y ). If ˆf is a classifier minimizing the average misclassification rate of L on a set Laboratoire MAP5 - UMR 845, Université Paris Descartes, Paris Cedex 06, France - Servane.Gey@parisdescartes.fr

2 of classifiers having finite VC dimension V, then, without further assumption on the distribution P of (X, Y ), the performance of ˆf is evaluated as follows: ( )] E L [P ˆf(X) Y C bias 2 ( ˆf) V + C 2 n, () where E L denotes the expectation with respect to the sample distribution, bias( ˆf) denotes the bias of the classifier ˆf, and C and C 2 are absolute constants. Functional estimates defined on partitions of R d are often used to estimate relationships between two variables X R d and Y 0; } or Y R (such as histograms, piecewise polynomials, or splines for example). In many cases, the VC dimension of the set of subsets used to construct the partition appears inside risk bounds when evaluating the performance of such estimators. For example, if the set used is the set of all half-spaces of R d, often its VC dimension d + has to be taken into account. When d is large, it is often computationally easier to construct partitions using axis-parallel cuts. For example, some theoretical developments on dyadic partitions of R 2 are given in [4, ], and the VC dimension of axis-parallel cuts appears more particularly in the results obtained on the performance of classification and regression binary decision trees (CART) introduced by Breiman et. al [2] in 984, and theoretically studied in [8, 7, 5, 6]. 2 Reminder about VC Dimension The VC dimension of a set A of subsets of some measurable space X is based on counting the number of intersects of A with a finite set of fixed points in X. Definition (Vapnik-Chervonenkis Dimension). Let A be a set of subsets of some measurable space X. Then (x,..., x n ) X n will be said to be shattered by A if all subsets of x ;... ; x n } are covered by A, that is if x,..., x n } A ; A A} = 2 n. The Vapnik-Chervonenkis dimension V C(A) of A is then defined as the maximal integer n such that there exists n points in X shattered by A, i.e. V C(A) = max n ; If no such n exists, then V C(A) = +. max x,..., x n } A ; A A} = 2 n (x,...,x n) X n }. 2

3 Thus, it is easily seen that the larger V C(A), the more complex A. For example, if A = ] ; x] ; x R}, then V C(A) = ; or if A is the set of all half-spaces in R d, then V C(A) = d +. Since axis-parallel cuts is a subset of the set of all half-spaces in R d, it could be natural to think that its VC dimension is of order d. Actually, it is shown in what follows that it is of order log 2 d. 3 VC Dimension of axis-parallel cuts We give a formula to compute the VC dimension of axis-parallel cuts in R d. Since the obtained formula is not always easy to handle, an approximation is also given. Lemma. Let Then A d = } x R d ; x i a}; i =,..., d, a R. V C(A d ) = max n ; ( ) } n d, where x denotes the integer part of x. Furthermore, the following approximation of V C(A d ) is available for all d 2: log d log V C(A d) log ( ) d d log 2 Figure shows that V C(A d ) is a piecewise constant function of the space dimension d, which increases at a rate much smaller than the intuitive value of d. It also shows that the bounds computed from the Stirling s formula are sharp. Proof. The idea is that, to have n points (x,..., x n ) shattered by A d, all the subsets of x,..., x n } should be covered by A d. But, if there exists p n such that there is more than d+ subsets of x,..., x n } having p elements, then A d will miss at least ( n p) d subsets: let n and (x,..., x n ) be n points in R d. Suppose that n is such that > d. This means that there are at least d + subsets of x,..., x n } of size. For each 3

4 Figure : V C(A d ) with respect to the space dimension d and Stirling s bounds. coordinate i =,..., d, let us denote by x i(.) the ordered statistic computed from the i th coordinate of (x,..., x n ), that is, for all i =,..., d, Let p = and let x i i() xi i(2)... xi i(n). B p = x i() ;... ; x i(p) } ; i =,..., d and x i() ;... ; x i(p) } = p }, B c p = B x,..., x n }; B = p and B / B p }. Hence B p is covered by A d (by simply taking A = x i (x i i(p) + xi i(p+) )/2} for each coordinate), and we have that: B p d and Bp c d > 0. p Let B Bp c and A = x i a} A d. If x,..., x n } A p, then x,..., x n } A B. Else, since x,..., x n } A = x j ; x i j a}, we have that x i i(j) a for all j =,..., p, and xi i(j) > a for all j = p +,..., n. So x,..., x n } A = x i() ;... ; x i(p) } and x i() ;... ; x i(p) } = p, leading to x,..., x n } A B p, and then to x,..., x n } A B. So, for all B Bp c 4

5 and all ( A A) d, x,..., x n } A B. n So, if > d, (x,..., x n ) can not be shattered by A d. Thus V C(A d ) max n ; ( ) } n d. Let n such that d. Let (x,..., x n ) be n points of R d defined as follows: for each coordinate i =,..., ( n ), let i ;... ; i } be the i th subset of indices in ;... ; n}, where the indices are denoted in ascending order, i.e.: i <... < i n. ( ) ( ) n n Since d, we obtain distinct subsets of indices. Hence we take for each such coordinate x i i k = k. Then the remaining values of (x,..., x n ) are taken as follows: ( ) n Since d, for each subset i ;... ; i + + } of ;... ; n} with + elements, there exists i ;... ; ( n ) } such that i ;... ; i } = i ;... ; i }. Then take xi i + = +. Let us note that, if n is odd, there is a bijection between i and i. Let j ;... ; j m } = j / i ;... ; i + }}, with j <... < j m, and let j 0 = i +. Then take x i j k = x i j k +. If not filled, the last coordinates are set to be equal to n. Hence, we obtain that, for all j / i ;... ; i }, x i j +. Then (x,..., x n ) is shattered by A d : for p 0;... ; n}, let B = x i ;... ; x ip } x,..., x n }, with i < i 2 <... < i p n as soon as p 0. If p = 0, let i 0 = argmin i d min x i j j, and take A = x i 0 min j x i 0 j }. Then B = x,..., x n } A =. If p = n, let i n = argmax i d max x i j, j 5

6 and take A = x in max j x in j + }. Then B = x,..., x n } A = x,..., x n }. If 0 < p, let A A d be the subset defined by A = x i p + /2}, with i the coordinate corresponding to a subset of indices i ;... ; i } containing i ;... ; i p }. Then, by definition of (x i,..., xi n), B = x,..., x n } A. If + p < n, let i be the coordinate corresponding to the configuration i ;... ; i + } (as defined by (x,..., x n )). Let A A d be the subset defined by A = x i p + /2}. Then, by definition of (x i,..., xi n), B = x,..., x n } A. Thus V C(A d ) max n ; ( ) } n d. Then, the lower and upper bounds of V C(A d ) are computed by using the Stirling s formula: for all n we have 2πe (n+) (n + ) n+ 2 n! 2πe (n+) e 2(n+) (n + ) n+ 2. A simple calculation gives the following: if n is even, then n/2 and if n is odd, then Thus, if then d. V C(A d ). On the other hand, if n/2 e 2(n+) e 2(n+) e n+ (n + )n+/2 2 2π e 2π 2 36 (n + 2) n+ e+ 2 n+, 6π n+ (n + )(n+)/2 (n + 3) e π 2 n+ d, n/2+ e+ 24 2π 2 n. Taking the logarithm leads to the lower bound of d, we have that, if n is even, e e 3n+6 n+ (n + )n+/2 2 2π 2 (n + 2) n+ e 2 n+, 2π d + 2 6

7 and if n is odd, ( n ) e n+2 e 3(n+3)(n+) 2 2π n+ (n + )(n+)/2 (n + 3) e 8 n/2+ 2 n+. 2π d + 3 e 8 Thus, since, for all n such that d, 2 n+ d, the upper 2π bound of V C(A d ) is found by taking the logarithm of this last expression. References [] Akakpo, N. Adaptation to anisotropy and inhomogeneity via dyadic piecewise polynomial selection. Mathematical Methods of Statistics 2, (202), 28. [2] Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. Classification And Regression Trees. Chapman & Hall, 984. [3] Devroye, L., Györfi, L., and Lugosi, G. A probabilistic theory of pattern recognition, vol. 3 of Applications of Mathematics (New York). Springer-Verlag, New York, 996. [4] Donoho, D. L. CART and best-ortho-basis : A connection. The Annals of Statistics 25, 5 (997), [5] Gey, S. Risk bounds for cart classifiers under a margin condition. Pattern Recognition 45 (202), [6] Gey, S., and Mary Huard, T. Risk bounds for embedded variable selection in classification trees. Tech. rep., arxiv, v, 20. [7] Gey, S., and Nedelec, E. Model selection for CART regression trees. IEEE Trans. Inform. Theory 5, 2 (2005), [8] Nobel, A. B. Analysis of a complexity-based pruning scheme for classification trees. IEEE Trans. Inform. Theory 48, 8 (2002), [9] Vapnik, V. N., and Chervonenkis, A. Y. Theory of uniform convergence of frequencies of events to their probabilities and problems of search for an optimal solution from empirical data. Avtomat. i Telemeh., 2 (97),

8 [0] Vapnik, V. N., and Chervonenkis, A. Y. Teoriya raspoznavaniya obrazov. Statisticheskie problemy obucheniya. Izdat. Nauka, Moscow,

Dyadic Classification Trees via Structural Risk Minimization

Dyadic Classification Trees via Structural Risk Minimization Dyadic Classification Trees via Structural Risk Minimization Clayton Scott and Robert Nowak Department of Electrical and Computer Engineering Rice University Houston, TX 77005 cscott,nowak @rice.edu Abstract

More information

Risk Bounds for CART Classifiers under a Margin Condition

Risk Bounds for CART Classifiers under a Margin Condition arxiv:0902.3130v5 stat.ml 1 Mar 2012 Risk Bounds for CART Classifiers under a Margin Condition Servane Gey March 2, 2012 Abstract Non asymptotic risk bounds for Classification And Regression Trees (CART)

More information

BINARY TREE-STRUCTURED PARTITION AND CLASSIFICATION SCHEMES

BINARY TREE-STRUCTURED PARTITION AND CLASSIFICATION SCHEMES BINARY TREE-STRUCTURED PARTITION AND CLASSIFICATION SCHEMES DAVID MCDIARMID Abstract Binary tree-structured partition and classification schemes are a class of nonparametric tree-based approaches to classification

More information

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI An Introduction to Statistical Theory of Learning Nakul Verma Janelia, HHMI Towards formalizing learning What does it mean to learn a concept? Gain knowledge or experience of the concept. The basic process

More information

Random forests and averaging classifiers

Random forests and averaging classifiers Random forests and averaging classifiers Gábor Lugosi ICREA and Pompeu Fabra University Barcelona joint work with Gérard Biau (Paris 6) Luc Devroye (McGill, Montreal) Leo Breiman Binary classification

More information

Generalization and Overfitting

Generalization and Overfitting Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

MULTIVARIATE HISTOGRAMS WITH DATA-DEPENDENT PARTITIONS

MULTIVARIATE HISTOGRAMS WITH DATA-DEPENDENT PARTITIONS Statistica Sinica 19 (2009), 159-176 MULTIVARIATE HISTOGRAMS WITH DATA-DEPENDENT PARTITIONS Jussi Klemelä University of Oulu Abstract: We consider estimation of multivariate densities with histograms which

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

EVALUATING MISCLASSIFICATION PROBABILITY USING EMPIRICAL RISK 1. Victor Nedel ko

EVALUATING MISCLASSIFICATION PROBABILITY USING EMPIRICAL RISK 1. Victor Nedel ko 94 International Journal "Information Theories & Applications" Vol13 [Raudys, 001] Raudys S, Statistical and neural classifiers, Springer, 001 [Mirenkova, 00] S V Mirenkova (edel ko) A method for prediction

More information

On Learnability, Complexity and Stability

On Learnability, Complexity and Stability On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly

More information

Using CART to Detect Multiple Change Points in the Mean for large samples

Using CART to Detect Multiple Change Points in the Mean for large samples Using CART to Detect Multiple Change Points in the Mean for large samples by Servane Gey and Emilie Lebarbier Research Report No. 12 February 28 Statistics for Systems Biology Group Jouy-en-Josas/Paris/Evry,

More information

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

More information

Fuzzy histograms and density estimation

Fuzzy histograms and density estimation Fuzzy histograms and density estimation Kevin LOQUIN 1 and Olivier STRAUSS LIRMM - 161 rue Ada - 3439 Montpellier cedex 5 - France 1 Kevin.Loquin@lirmm.fr Olivier.Strauss@lirmm.fr The probability density

More information

A Simple Algorithm for Learning Stable Machines

A Simple Algorithm for Learning Stable Machines A Simple Algorithm for Learning Stable Machines Savina Andonova and Andre Elisseeff and Theodoros Evgeniou and Massimiliano ontil Abstract. We present an algorithm for learning stable machines which is

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Statistics and learning: Big Data

Statistics and learning: Big Data Statistics and learning: Big Data Learning Decision Trees and an Introduction to Boosting Sébastien Gadat Toulouse School of Economics February 2017 S. Gadat (TSE) SAD 2013 1 / 30 Keywords Decision trees

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Optimal L 1 Bandwidth Selection for Variable Kernel Density Estimates

Optimal L 1 Bandwidth Selection for Variable Kernel Density Estimates Optimal L Bandwidth Selection for Variable Kernel Density Estimates Alain BERLINET, Gérard BIAU and Laurent ROUVIÈRE Institut de Mathématiques et de Modélisation de Montpellier, UMR CNRS 549, Equipe de

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Influence measures for CART

Influence measures for CART Jean-Michel Poggi Orsay, Paris Sud & Paris Descartes Joint work with Avner Bar-Hen Servane Gey (MAP5, Paris Descartes ) CART CART Classification And Regression Trees, Breiman et al. (1984) Learning set

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization : Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage

More information

Decision trees COMS 4771

Decision trees COMS 4771 Decision trees COMS 4771 1. Prediction functions (again) Learning prediction functions IID model for supervised learning: (X 1, Y 1),..., (X n, Y n), (X, Y ) are iid random pairs (i.e., labeled examples).

More information

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13 Boosting Ryan Tibshirani Data Mining: 36-462/36-662 April 25 2013 Optional reading: ISL 8.2, ESL 10.1 10.4, 10.7, 10.13 1 Reminder: classification trees Suppose that we are given training data (x i, y

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Ordibehesht 1390 Introduction For the analysis of data structures and algorithms

More information

International Journal "Information Theories & Applications" Vol.14 /

International Journal Information Theories & Applications Vol.14 / International Journal "Information Theories & Applications" Vol.4 / 2007 87 or 2) Nˆ t N. That criterion and parameters F, M, N assign method of constructing sample decision function. In order to estimate

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Optimal global rates of convergence for interpolation problems with random design

Optimal global rates of convergence for interpolation problems with random design Optimal global rates of convergence for interpolation problems with random design Michael Kohler 1 and Adam Krzyżak 2, 1 Fachbereich Mathematik, Technische Universität Darmstadt, Schlossgartenstr. 7, 64289

More information

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă mmp@stat.washington.edu Reading: Murphy: BIC, AIC 8.4.2 (pp 255), SRM 6.5 (pp 204) Hastie, Tibshirani

More information

REGRESSION TREE CREDIBILITY MODEL

REGRESSION TREE CREDIBILITY MODEL LIQUN DIAO AND CHENGGUO WENG Department of Statistics and Actuarial Science, University of Waterloo Advances in Predictive Analytics Conference, Waterloo, Ontario Dec 1, 2017 Overview Statistical }{{ Method

More information

Support Vector Machines

Support Vector Machines Wien, June, 2010 Paul Hofmarcher, Stefan Theussl, WU Wien Hofmarcher/Theussl SVM 1/21 Linear Separable Separating Hyperplanes Non-Linear Separable Soft-Margin Hyperplanes Hofmarcher/Theussl SVM 2/21 (SVM)

More information

Adaptive sparse grids

Adaptive sparse grids ANZIAM J. 44 (E) ppc335 C353, 2003 C335 Adaptive sparse grids M. Hegland (Received 1 June 2001) Abstract Sparse grids, as studied by Zenger and Griebel in the last 10 years have been very successful in

More information

On the S-Labeling problem

On the S-Labeling problem On the S-Labeling problem Guillaume Fertin Laboratoire d Informatique de Nantes-Atlantique (LINA), UMR CNRS 6241 Université de Nantes, 2 rue de la Houssinière, 4422 Nantes Cedex - France guillaume.fertin@univ-nantes.fr

More information

Monomial transformations of the projective space

Monomial transformations of the projective space Monomial transformations of the projective space Olivier Debarre and Bodo Lass Abstract We prove that, over any field, the dimension of the indeterminacy locus of a rational map f : P n P n defined by

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

More information

A talk on Oracle inequalities and regularization. by Sara van de Geer

A talk on Oracle inequalities and regularization. by Sara van de Geer A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Fast learning rates for plug-in classifiers under the margin condition

Fast learning rates for plug-in classifiers under the margin condition Fast learning rates for plug-in classifiers under the margin condition Jean-Yves Audibert 1 Alexandre B. Tsybakov 2 1 Certis ParisTech - Ecole des Ponts, France 2 LPMA Université Pierre et Marie Curie,

More information

Computational and Statistical Learning theory

Computational and Statistical Learning theory Computational and Statistical Learning theory Problem set 2 Due: January 31st Email solutions to : karthik at ttic dot edu Notation : Input space : X Label space : Y = {±1} Sample : (x 1, y 1,..., (x n,

More information

MUTUAL INFORMATION (MI) specifies the level of

MUTUAL INFORMATION (MI) specifies the level of IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 58, NO. 7, JULY 2010 3497 Nonproduct Data-Dependent Partitions for Mutual Information Estimation: Strong Consistency and Applications Jorge Silva, Member, IEEE,

More information

Model Selection and Error Estimation

Model Selection and Error Estimation Model Selection and Error Estimation Peter L. Bartlett Stéphane Boucheron Computer Sciences Laboratory Laboratoire de Recherche en Informatique RSISE, Australian National University, Bâtiment 490 Canberra

More information

Cognitive Cyber-Physical System

Cognitive Cyber-Physical System Cognitive Cyber-Physical System Physical to Cyber-Physical The emergence of non-trivial embedded sensor units, networked embedded systems and sensor/actuator networks has made possible the design and implementation

More information

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Least Absolute Shrinkage is Equivalent to Quadratic Penalization Least Absolute Shrinkage is Equivalent to Quadratic Penalization Yves Grandvalet Heudiasyc, UMR CNRS 6599, Université de Technologie de Compiègne, BP 20.529, 60205 Compiègne Cedex, France Yves.Grandvalet@hds.utc.fr

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lessons 6 10 Jan 2017 Outline Perceptrons and Support Vector machines Notation... 2 Perceptrons... 3 History...3

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Recap from previous lecture

Recap from previous lecture Recap from previous lecture Learning is using past experience to improve future performance. Different types of learning: supervised unsupervised reinforcement active online... For a machine, experience

More information

Random projection ensemble classification

Random projection ensemble classification Random projection ensemble classification Timothy I. Cannings Statistics for Big Data Workshop, Brunel Joint work with Richard Samworth Introduction to classification Observe data from two classes, pairs

More information

Statistical and Computational Learning Theory

Statistical and Computational Learning Theory Statistical and Computational Learning Theory Fundamental Question: Predict Error Rates Given: Find: The space H of hypotheses The number and distribution of the training examples S The complexity of the

More information

A Bias Correction for the Minimum Error Rate in Cross-validation

A Bias Correction for the Minimum Error Rate in Cross-validation A Bias Correction for the Minimum Error Rate in Cross-validation Ryan J. Tibshirani Robert Tibshirani Abstract Tuning parameters in supervised learning problems are often estimated by cross-validation.

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

arxiv: v1 [math.st] 14 Mar 2016

arxiv: v1 [math.st] 14 Mar 2016 Impact of subsampling and pruning on random forests. arxiv:1603.04261v1 math.st] 14 Mar 2016 Roxane Duroux Sorbonne Universités, UPMC Univ Paris 06, F-75005, Paris, France roxane.duroux@upmc.fr Erwan Scornet

More information

Histogram Regression Estimation Using Data-dependent Partitions

Histogram Regression Estimation Using Data-dependent Partitions Histogram Regression Estimation Using Data-dependent Partitions Andrew Nobel September 1995 Abstract We establish general sufficient conditions for the L 2 -consistency of multivariate histogram regression

More information

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7 Bagging Ryan Tibshirani Data Mining: 36-462/36-662 April 23 2013 Optional reading: ISL 8.2, ESL 8.7 1 Reminder: classification trees Our task is to predict the class label y {1,... K} given a feature vector

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Vapnik-Chervonenkis Dimension of Neural Nets

Vapnik-Chervonenkis Dimension of Neural Nets P. L. Bartlett and W. Maass: Vapnik-Chervonenkis Dimension of Neural Nets 1 Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley

More information

Adaptive Minimax Classification with Dyadic Decision Trees

Adaptive Minimax Classification with Dyadic Decision Trees Adaptive Minimax Classification with Dyadic Decision Trees Clayton Scott Robert Nowak Electrical and Computer Engineering Electrical and Computer Engineering Rice University University of Wisconsin Houston,

More information

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti

MACHINE LEARNING. Support Vector Machines. Alessandro Moschitti MACHINE LEARNING Support Vector Machines Alessandro Moschitti Department of information and communication technology University of Trento Email: moschitti@dit.unitn.it Summary Support Vector Machines

More information

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1. Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities

SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1. Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities SUBMITTED TO IEEE TRANSACTIONS ON INFORMATION THEORY 1 Minimax-Optimal Bounds for Detectors Based on Estimated Prior Probabilities Jiantao Jiao*, Lin Zhang, Member, IEEE and Robert D. Nowak, Fellow, IEEE

More information

Ensemble Methods and Random Forests

Ensemble Methods and Random Forests Ensemble Methods and Random Forests Vaishnavi S May 2017 1 Introduction We have seen various analysis for classification and regression in the course. One of the common methods to reduce the generalization

More information

Maximal Width Learning of Binary Functions

Maximal Width Learning of Binary Functions Maximal Width Learning of Binary Functions Martin Anthony Department of Mathematics, London School of Economics, Houghton Street, London WC2A2AE, UK Joel Ratsaby Electrical and Electronics Engineering

More information

Machine Leanring Theory and Applications: third lecture

Machine Leanring Theory and Applications: third lecture Machine Leanring Theory and Applications: third lecture Arnak Dalalyan ENSAE PARISTECH 12 avril 2016 Framework and notation Framework and notation We observe (X i, Y i ) X Y, i = 1,..., n independent randomly

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Does Modeling Lead to More Accurate Classification?

Does Modeling Lead to More Accurate Classification? Does Modeling Lead to More Accurate Classification? A Comparison of the Efficiency of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang

More information

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes. Random Forests One of the best known classifiers is the random forest. It is very simple and effective but there is still a large gap between theory and practice. Basically, a random forest is an average

More information

Chapter 14 Combining Models

Chapter 14 Combining Models Chapter 14 Combining Models T-61.62 Special Course II: Pattern Recognition and Machine Learning Spring 27 Laboratory of Computer and Information Science TKK April 3th 27 Outline Independent Mixing Coefficients

More information

TREE-STRUCTURED REGRESSION AND THE DIFFERENTIATION OF INTEGRALS 1. BY RICHARD A. OLSHEN Stanford University. In Memory of Leo Breiman,

TREE-STRUCTURED REGRESSION AND THE DIFFERENTIATION OF INTEGRALS 1. BY RICHARD A. OLSHEN Stanford University. In Memory of Leo Breiman, The Annals of Statistics 2007, Vol. 35, No. 1, 1 12 DOI: 10.1214/009053606000001000 Institute of Mathematical Statistics, 2007 TREE-STRUCTURED REGRESSION AND THE DIFFERENTIATION OF INTEGRALS 1 BY RICHARD

More information

A Study of Relative Efficiency and Robustness of Classification Methods

A Study of Relative Efficiency and Robustness of Classification Methods A Study of Relative Efficiency and Robustness of Classification Methods Yoonkyung Lee* Department of Statistics The Ohio State University *joint work with Rui Wang April 28, 2011 Department of Statistics

More information

CSE 648: Advanced algorithms

CSE 648: Advanced algorithms CSE 648: Advanced algorithms Topic: VC-Dimension & ɛ-nets April 03, 2003 Lecturer: Piyush Kumar Scribed by: Gen Ohkawa Please Help us improve this draft. If you read these notes, please send feedback.

More information

On threshold-based classification rules By Leila Mohammadi and Sara van de Geer. Mathematical Institute, University of Leiden

On threshold-based classification rules By Leila Mohammadi and Sara van de Geer. Mathematical Institute, University of Leiden On threshold-based classification rules By Leila Mohammadi and Sara van de Geer Mathematical Institute, University of Leiden Abstract. Suppose we have n i.i.d. copies {(X i, Y i ), i = 1,..., n} of an

More information

Curve learning. p.1/35

Curve learning. p.1/35 Curve learning Gérard Biau UNIVERSITÉ MONTPELLIER II p.1/35 Summary The problem The mathematical model Functional classification 1. Fourier filtering 2. Wavelet filtering Applications p.2/35 The problem

More information

Harrison B. Prosper. Bari Lectures

Harrison B. Prosper. Bari Lectures Harrison B. Prosper Florida State University Bari Lectures 30, 31 May, 1 June 2016 Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 1 h Lecture 1 h Introduction h Classification h Grid Searches

More information

Performance of Cross Validation in Tree-Based Models

Performance of Cross Validation in Tree-Based Models Performance of Cross Validation in Tree-Based Models Seoung Bum Kim, Xiaoming Huo, Kwok-Leung Tsui School of Industrial and Systems Engineering Georgia Institute of Technology Atlanta, Georgia 30332 {sbkim,xiaoming,ktsui}@isye.gatech.edu

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

arxiv: v1 [cs.ds] 28 Sep 2018

arxiv: v1 [cs.ds] 28 Sep 2018 Minimization of Gini impurity via connections with the k-means problem arxiv:1810.00029v1 [cs.ds] 28 Sep 2018 Eduardo Laber PUC-Rio, Brazil laber@inf.puc-rio.br October 2, 2018 Abstract Lucas Murtinho

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

ECE521 Lecture7. Logistic Regression

ECE521 Lecture7. Logistic Regression ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

More information

SUPPORT VECTOR MACHINE

SUPPORT VECTOR MACHINE SUPPORT VECTOR MACHINE Mainly based on https://nlp.stanford.edu/ir-book/pdf/15svm.pdf 1 Overview SVM is a huge topic Integration of MMDS, IIR, and Andrew Moore s slides here Our foci: Geometric intuition

More information

The Vapnik-Chervonenkis Dimension

The Vapnik-Chervonenkis Dimension The Vapnik-Chervonenkis Dimension Prof. Dan A. Simovici UMB 1 / 91 Outline 1 Growth Functions 2 Basic Definitions for Vapnik-Chervonenkis Dimension 3 The Sauer-Shelah Theorem 4 The Link between VCD and

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Constructing Prediction Intervals for Random Forests

Constructing Prediction Intervals for Random Forests Senior Thesis in Mathematics Constructing Prediction Intervals for Random Forests Author: Benjamin Lu Advisor: Dr. Jo Hardin Submitted to Pomona College in Partial Fulfillment of the Degree of Bachelor

More information

CSCI 5622 Machine Learning

CSCI 5622 Machine Learning CSCI 5622 Machine Learning DATE READ DUE Mon, Aug 31 1, 2 & 3 Wed, Sept 2 3 & 5 Wed, Sept 9 TBA Prelim Proposal www.rodneynielsen.com/teaching/csci5622f09/ Instructor: Rodney Nielsen Assistant Professor

More information

Support Vector Machines with a Reject Option

Support Vector Machines with a Reject Option Support Vector Machines with a Reject Option Yves Grandvalet, 2, Alain Rakotomamonjy 3, Joseph Keshet 2 and Stéphane Canu 3 Heudiasyc, UMR CNRS 6599 2 Idiap Research Institute Université de Technologie

More information

Escaping the curse of dimensionality with a tree-based regressor

Escaping the curse of dimensionality with a tree-based regressor Escaping the curse of dimensionality with a tree-based regressor Samory Kpotufe UCSD CSE Curse of dimensionality In general: Computational and/or prediction performance deteriorate as the dimension D increases.

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities ; CU- CS

Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities ; CU- CS University of Colorado, Boulder CU Scholar Computer Science Technical Reports Computer Science Spring 5-1-23 Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities

More information

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

Boosting Methods: Why They Can Be Useful for High-Dimensional Data New URL: http://www.r-project.org/conferences/dsc-2003/ Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003) March 20 22, Vienna, Austria ISSN 1609-395X Kurt Hornik,

More information

Classification via kernel regression based on univariate product density estimators

Classification via kernel regression based on univariate product density estimators Classification via kernel regression based on univariate product density estimators Bezza Hafidi 1, Abdelkarim Merbouha 2, and Abdallah Mkhadri 1 1 Department of Mathematics, Cadi Ayyad University, BP

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

Growing a Large Tree

Growing a Large Tree STAT 5703 Fall, 2004 Data Mining Methodology I Decision Tree I Growing a Large Tree Contents 1 A Single Split 2 1.1 Node Impurity.................................. 2 1.2 Computation of i(t)................................

More information

Variance Reduction and Ensemble Methods

Variance Reduction and Ensemble Methods Variance Reduction and Ensemble Methods Nicholas Ruozzi University of Texas at Dallas Based on the slides of Vibhav Gogate and David Sontag Last Time PAC learning Bias/variance tradeoff small hypothesis

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

Foundations of Machine Learning

Foundations of Machine Learning Introduction to ML Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu page 1 Logistics Prerequisites: basics in linear algebra, probability, and analysis of algorithms. Workload: about

More information

Supervised Learning Part I

Supervised Learning Part I Supervised Learning Part I http://www.lps.ens.fr/~nadal/cours/mva Jean-Pierre Nadal CNRS & EHESS Laboratoire de Physique Statistique (LPS, UMR 8550 CNRS - ENS UPMC Univ. Paris Diderot) Ecole Normale Supérieure

More information