Estimating the sample complexity of a multi-class. discriminant model
|
|
- Silvester Butler
- 6 years ago
- Views:
Transcription
1 Estimating the sample complexity of a multi-class discriminant model Yann Guermeur LIP, UMR NRS 0, Universit Paris, place Jussieu, 55 Paris cedex 05 Yann.Guermeur@lip.fr Andr Elissee and H l ne Paugam-Moisy ERI, Universit Lumi re Lyon 5, avenue Pierre Mend s-france, 9 Bron cedex {aelissee,hpaugam}@univ-lyon.fr Abstract We study the generalization performance of a multi-class discriminant model. Several bounds on its sample complexity are derived from uniform convergence results based on dierent measures of capacity. This gives us an insight into the nature of the capacity measure which is best suited to study multi-class discriminant model. 1 Introduction Since the pioneering work of Vapnik and hervonenkis [1], extending the classical Glivenko- antelli theorem to give a uniform convergence result over classes of indicator functions, many studies have dealt with uniform strong laws of large numbers (see for instance [10, ]). The bound they provide both in pattern recognition and regression estimation can readily be used to derive a sample complexity which is an increasing function of a particular measure of the capacity of the model considered. The choice of the appropriate bound as well as the computation of a tight upper bound on the capacity measure thus appear to be of central importance to derive a tight bound on the sample complexity. This is the subject we investigate in this paper, throughout an application on the Multivariate Linear Regression (MLR) combiner described in []. Up to now, very few studies in statistical learning theory have dealt with multi-class discrimination. Section briey outlines the implementation of the MLR model for class posterior probability estimates combination. Section is devoted to the estimation of the sample complexity using two combinatorial quantities generalizing the Vapnik-hervonenkis (V) dimension and called the graph dimension and the Natarajan dimension. In section, another bound is derived from a theorem where the capacity of the family of functions is characterized by its covering number. Both bounds are discussed in section 5 and further improvements are proposed in perspective. MLR combiner for classi- er combination We consider a -category discrimination task, under the usual hypothesis that there is a joint distribution, xed but unknown, on S = X Y, where X is the input space and Y the set of categories. We further assume that, for each input pattern x X, the outputs of P classiers are available. Let f j denote the function computed by the j th of these classiers: f j (x) = [f jk (x)] R. The k th output f jk (x) approximates the class posterior probability p( k jx). Precisely, f j (x) U with U = nu R + =1T u = 1 o 1
2 In other words, the outputs are non-negative and sum to 1. Let F (x) = [f j (x)], (1 j P) (F (x) U P ) be a vector of predictors. The MLR model studied here, parameterized by v = [v k ] R =P, computes the functions g G given by: g(x) = g 1 (x). g k (x). g (x) 5 = v T 1. v T ḳ. v T 5 F (x) Let be the set of loss functions satisfying the general conditions for outputs to be interpreted as probabilities (see for instance []) and s a N- sample of observations (s S N ). Among the functions of G, the MLR combiner is any of the functions which constitutes a solution to the following optimization problem: Problem 1 Given a convex loss function L and a N-sample s S N, nd a function in G minimizing the empirical risk ^J(v) and taking its values in U. The MLR combiner is thus designed to take as inputs class posterior probability estimates and output better estimates with respect to some given criterion (least-squares, cross-entropy : : :). v k;l;m, the general term of v k, is the coecient associated with the predictor f lm (x) in the regression computed to estimate p( k jx). Let v be the vector of all the parameters (v = [v k ] R P = ). As was pointed out in [], optimal solutions to Problem 1 are obtained by minimizing ^J(v) subject to v V with V = P v R P 8(l; + = m); k(v k;l;m? vk;l;) = 0 1 T P v = This result holds irrespective of the choice of L and s. Details on the computation of a global minimum can be found in [5]. To sum up, the model we study is a multiple-output perceptron, under the constraints 8x X; F (x) U P and v V. One can easily verify that the Euclidean norm kf (x)k of every vector in U P is bounded above by p P. Furthermore, solving a simple quadratic programming problem establishes that for all v V and k f1; : : : ; g, kv k k p. Growth function bounds For now on, we consider the MLR combiner as a discriminant model, by application of Bayes' estimated decision rule. Although the learning process described in the previous section only amounts to estimating probability densities on a nite sample, the criterion of interest, the generalization ability in terms of recognition rate, can be both estimated and controlled. To perform this task, we use results derived in the framework of the computational learning theory and the statistical learning theory introduced by Vapnik. Roughly speaking, they represent corollaries of uniform strong laws of large numbers. The theorems used in this section are derived from bounds grounded on a combinatorial capacity measure called the growth function. In order to express them, we must rst introduce additional notations and denitions. Let H be a family of functions from X to a nite set Y. E s (h) and E(h) respectively designate the error on s (observed error) and the generalization error of a function h belonging to H. Denition 1 Let H be a set of indicator functions (values in {0,1}). Let s X be a N-sample of X and H (s X ) the number of dierent classications of s X by the functions of H. The growth function H is dened by: H (N ) = max H (s X ) : s X X N : Denition The V dimension of a set H of indicator functions is the maximum number d of vectors that can be shattered, i.e. separated into two classes in all d possible ways, using functions of H. If this maximum does not exist, the V dimension is equal to innity.
3 These denitions only apply to sets of indicator functions. The following theorem, dealing with multiple-output functions, appears as an immediate corollary of a result due to Vapnik [1] (see for instance [1]): Theorem 1 Let s S N. With probability 1?, E(h) < E s (h)+ r 1 N (ln( GH (N ))? ln())+ 1 N GH, the graph space of H, is dened as follows: For h H, let Gh be the function from X Y to f0; 1g dened by Gh(x; y) = 1 () h(x) = y GH = fgh : h Hg. GH is the growth function of GH [9]. Thus, in order to obtain an upper bound on the condence interval in Theorem 1, one must nd an upper bound on the growth function associated with the MLR combiner. This constitutes the object of the next two subsections where two dierent bounds are stated..1 Graph dimension of the MLR combiner The discriminant functions computed by the MLR combiner are also computed by the single hidden layer perceptron with threshold activation functions depicted in Figure 1. Several articles have been devoted to bounding the growth function and V dimension of multilayer perceptrons [9, 11]. Proceeding as in [11], one can use as upper bound on the growth function the product of the bounds on the growth functions of the individual hidden units derived from Sauer's lemma. Since the V dimension of each hidden unit is at most equal to the dimension of the smallest subspace of R P containing U P and this dimension is d E = P (? 1) + 1 [], we get: GMLP (N ) < en d E (1) d E F(x) Px Figure 1: v1-v v1-v v1-v v1-v v-v v-v v-v v-v v-v v(-)-v(-1) v(-)-v v(-1)-v Bayes Architecture of a multi-layer perceptron with threshold units computing the same discriminant function as the MLR combiner. Each hidden unit computes a function h i;j (F (x)) = t? F (x) T (v i? v j ), (1 i < j ), where t(z) = 1 if z > 0 and t(z) =?1 otherwise. The weights of the output layer, either +1 (solid lines) or?1 (dashed lines), and the biases are chosen so that the output units compute a logical AND. The number below each layer corresponds to the number of units. This bound can be signicantly improved, by making use of the dependences between the hidden units. For lack of place, we only exhibit here a simple way to do so. Let x s X. Among all the possible classications performed by the hidden unit whose vector is v 1? v, exactly half of them associate to F (x) the value 1. The same is true for the hidden unit whose vector is v? v. Now if these two units provide the same output for F (x), then the output for the hidden unit whose vector is v 1? v is known. onsequently, the growth function associated with these three units is at most en de de. Proceedings step by step, we thus get an improved bound: GMLP (N ) < g(x)?1 en d E () d E
4 . Natarajan dimension Natarajan introduced in [9] an extension of the V dimension for multiple-output functions based on the following denition of shattering: Denition A set H of discrete-valued functions shatters a set s X of vectors () there exist two functions h 1 and h belonging to H such that: (a) for any x s X, h 1 (x) = h (x), (b) for all s 1 s X, there exists h H such that h agrees with h 1 on s 1 and with h on s X n s 1, i.e. 8x s 1, h (x) = h 1 (x), 8x s X n s 1, h (x) = h (x). We established the following result in [] Theorem The Natarajan dimension d N of the MLR combiner satises: P + b 1 c d N (d E? 1) In [8], the following theorem was proved Theorem Let N c H (N ) be the number of dierent classications performed by a set of functions H on a set of size N. Let d N be the Natarajan dimension of H. Then, for all d d N : X i=d N c H (N ) N i i=0 We have the following inequality: Xi=d i=0 i N? i? i en d d () Obviously, GH (N ) N c H (N ). Thus, substituting the upper bound on d N provided by Theorem for d in the formula of Theorem and applying () we get: GMLP (N ) < en (d E? 1)! (d E?1) () This last bound is very similar to those provided by (1) and (). This is a good indication that these bounds are quite loose, since bounding GH (N ) by N c H (N ) is crude. In fact, the diculty with the use of the growth function lies in the way to take into account the specicity of the model (the link between the hidden units), when applying a generalization of Sauer's lemma. This diculty concerns all the approaches using combinatorial methods which do not consider how the model is built. This led us to consider an alternative definition of the capacity of the combiner which is an original method to have condence bounds for multi-class discriminant models based on the computation of the a posteriori probability p( k jx). overing number bounds Many studies have been devoted to giving rates of uniform convergence of the means to their expectations, based on a capacity measure called the covering number [1, 10, ]. To dene this measure, we rst introduce the notion of -cover in a pseudo-metric space. Denition Let (E; ) be a pseudo-metric space. A nite set T E is an?cover of a set H E with respect to if for all h H there is a h T such that (h; h). Denition 5 Given and, the covering number N (; H; ) of H is the size of the smallest -cover of H. overing numbers were introduced in learning theory to derive bounds for regression estimation problems. In [], Bartlett has extended their use to discrimination, specically to dichotomy computation. For now on, H is supposed to be a set of functions taking their values in R. The discriminant function associated with any of the functions h H is the function t h, where t is the sign function dened in the caption of Figure 1. E(h) is dened accordingly.
5 Denition Let : R! [?; ] be a piecewiselinear function such that: t(x) if jxj (x) = x otherwise Let us dene H = f (h); h Hg and N 1 (; H; N ) = max sx N (; H; k:k sx 1 ), where khk sx 1 = max 1iN jh(x i )j. Let the empirical error according to the margin be dened as: Denition E s (h) = 1 N jf(x i; y i ) s=h(x i )y i < gj We have: Theorem Suppose > 0, and 0 < < 1=, then, with probability at least 1?, s E(h) Es (h) + N1 N ln (=; H ; N ) The MLR model takes its values in [0; 1]. For each pattern x, the desired output y is the canonical coding of the category of x. If jg k (x)? y k j < 1= for all k, then the use of Bayes' estimated decision rule will provide the correct classication for x. A contrario, an error will occur only if jg k (x)? y k j 1= for at least one k. For all k, we dene E(g k ) as being the error for g k? 1= and E(g) as the error in generalization of the discriminant function associated with g. Then, we have: E(g) X k=1 E(g k ) (5) As in the previous section, bounding the generalization ability of the MLR combiner amounts to bounding a measure of complexity. This time however, this measure is dened for each individual function g k. The end of this section thus deals with stating bounds on N 1 (=; H ; N ), where H equals fg k?1=g for any k in f1; : : : ; g. Since the function is 1-lipschitzian, N 1 (=; H ; N ) N 1 (=; H; N ) Let T be the ane application dened over B P (the unit ball of R P endowed with k:k ) as: T : B P! R N w! p [F (x 1 ) T w; : : : ; F (x N ) T w] T? (1=) N It has the property that f[f (x 1 ) T w? 1=; : : : ; F (x N ) T w? 1=] T s.t. kwk p g T (B P ). Hence, since kv k k p, N 1 (=; H; N ) According to [], we have, N (=; T (B P ); k:k 1 ) max s 0 X XN N (=; T (B P); k:k 1 ) ()! P 8k T ~ k () where, T ~ is the linear operator associated with T and k T ~ k Twk k = sup ~ 1 wbp kwk. Since kf (x)k p P, from auchy-schwarz inequality k T ~ k is bounded above by p P. By injection of () in () and by ap plying Theorem to bound the right hand side of (5), it yields E(g) k= X k=1 E k s (g k) + s N ln + P ln 8p P k! This bound on the error should be related to bound derived from theorem 1 by substituting GH (N ) by the successive bounds established in section. 5 Discussion and onclusion This paper analyses two strategies to derive bounds on the generalization error for a multiclass discriminant model. One is based on combinatorial dimensions and the other uses covering numbers. The bounds on the generalization ability derived in the two former sections cannot be readily compared, since they rest on dierent denitions of the empirical error. A direct application of these results on a biological application [5] 5
6 leads to similar bounds in both cases which are roughly 5%. These bounds are not accurate for such an application. In their current implementation, both methods fail to provide useful bounds for real-world application. However, this study sheds lights on some features which could be used to improve the generalization control so as to make it of practical interest. As pointed out in section, the combinatorial approach does not take into account the specicity of the model and should be avoided as it is used here. Improving the Natarajan dimension will not indeed decrease the gap between GH (N ) and N ch (N ), which is one of the most important Achilles' heels of the method. To use knowledge of the structure when bounding the generalization error, we introduce a method based on covering numbers which is new for multi-class discrimination. This method has the disadvantage P however to crudely bound E(g) by the sum k=1 E(g k). This is the main bug of the method and future works should x it somehow. One way to do so is to develop a global approach by directly controlling the generalization error of E(g) in terms of covering numbers instead of controlling the generalization errors E(g k ). This will be the subject of our next research. Thus, by presenting a new approach for the study of multi-class discriminant model with real internal representations, we have stated new bounds and pointed out a way to improve them. overing numbers allow to use knowledge of the learning system and to include it in the condence bounds. Further work will be to derive more practical bounds for the model of interest. References [1] M. Anthony (199): Probabilistic Analysis of Learning in Articial Neural Networks: The PA Model and its Variants. Neural omputing Surveys, Vol. 1, 1-. [] P. Bartlett (199): The sample complexity of pattern classication with neural networks: the size of the weights is more important than the size of the network. Technical report, Department of Systems Engineering, Australian National University, ftp : syseng.anu.edu.au:pub/peter/tr9d.ps. [].M. Bishop (1995): Neural Networks for Pattern Recognition. larendon Press, Oxford. [] B. arl and I. Stephani (1990): Entropy, compactness, and the approximation of operators. ambridge University Press, ambridge, UK. [5] Y. Guermeur,. Geourjon, P. Gallinari and G. Del age (1999): Improved Performance in Protein Secondary Structure Prediction by Inhomogeneous Score ombination. To appear in Bioinformatics. [] Y. Guermeur, H. Paugam-Moisy and P. Gallinari (1998): Multivariate Linear Regression on lassi- er Outputs: a apacity Study. IANN'98, [] D. Haussler (199): Decision Theoretic Generalizations of the PA Model for Neural Net and Other Learning Applications. Information and omputation, 100, [8] D. Haussler and P.M. Long. (1995): A Generalization of Sauer's Lemma. Journal of ombinatorial Theory, Series A, 1, [9] B.K. Natarajan (1989): On Learning Sets and Functions. Machine Learning, Vol., -9. [10] D. Pollard (198): onvergence of Stochastic Processes. Springer Series in Statistics, Springer- Verlag, N.Y. [11] J. Shawe-Taylor and M. Anthony (1991): Sample sizes for multiple-output threshold networks. Network: omputation in Neural Systems, Vol., [1] V.N. Vapnik (198): Estimation of Dependences Based on Empirical Data. Springer-Verlag, N.Y. [1] V.N. Vapnik (1998): Statistical Learning Theory. John Wiley & Sons, IN., N.Y. [1] V.N. Vapnik and A.Y. hervonenkis (191): On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Application, Vol. 1, -80.
Sample width for multi-category classifiers
R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University
More informationOptimal Linear Regression on Classifier Outputs
Optimal Linear Regression on Classifier Outputs Yann Guermeur, Florence d Alché-Buc and Patrick Gallinari LIP6, Université Pierre et Marie Curie Tours 46-, Boîte 169 4, Place Jussieu, 75252 Paris cedex
More informationData-Dependent Structural Risk. Decision Trees. John Shawe-Taylor. Royal Holloway, University of London 1. Nello Cristianini. University of Bristol 2
Data-Dependent Structural Risk Minimisation for Perceptron Decision Trees John Shawe-Taylor Royal Holloway, University of London 1 Email: jst@dcs.rhbnc.ac.uk Nello Cristianini University of Bristol 2 Email:
More informationModel Selection for Multi-class SVMs
Model Selection for Multi-class SVMs Yann Guermeur 1, Myriam Maumy 2, and Frédéric Sur 1 1 LORIA-CNRS Campus Scientifique, BP 239, 54506 Vandœuvre-lès-Nancy Cedex, France (e-mail: Yann.Guermeur@loria.fr,
More informationThe best expert versus the smartest algorithm
Theoretical Computer Science 34 004 361 380 www.elsevier.com/locate/tcs The best expert versus the smartest algorithm Peter Chen a, Guoli Ding b; a Department of Computer Science, Louisiana State University,
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationA Result of Vapnik with Applications
A Result of Vapnik with Applications Martin Anthony Department of Statistical and Mathematical Sciences London School of Economics Houghton Street London WC2A 2AE, U.K. John Shawe-Taylor Department of
More information1 GMD FIRST, Rudower Chaussee 5, Berlin, Germany 2 Department of Computer Science, Royal Holloway College, U
Generalization Bounds via Eigenvalues of the Gram Matrix Bernhard Scholkopf, GMD 1 John Shawe-Taylor, University of London 2 Alexander J. Smola, GMD 3 Robert C. Williamson, ANU 4 NeuroCOLT2 Technical Report
More informationStatistical learning theory, Support vector machines, and Bioinformatics
1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.
More informationMartin Anthony. The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom.
Function Learning from Interpolation 1 Martin Anthony Department of Mathematics, The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom. Email anthony@vax.lse.ac.uk
More informationFat-shattering. and the learnability of real-valued functions. Peter L. Bartlett. Department of Systems Engineering
Fat-shattering and the learnability of real-valued functions Peter L. Bartlett Department of Systems Engineering Research School of Information Sciences and Engineering Australian National University Canberra,
More informationStatistical Learning Theory
Statistical Learning Theory Fundamentals Miguel A. Veganzones Grupo Inteligencia Computacional Universidad del País Vasco (Grupo Inteligencia Vapnik Computacional Universidad del País Vasco) UPV/EHU 1
More informationPlan of Class 4. Radial Basis Functions with moving centers. Projection Pursuit Regression and ridge. Principal Component Analysis: basic ideas
Plan of Class 4 Radial Basis Functions with moving centers Multilayer Perceptrons Projection Pursuit Regression and ridge functions approximation Principal Component Analysis: basic ideas Radial Basis
More informationCOMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization
: Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage
More informationAdvanced Machine Learning
Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear
More informationLinear Regression and Its Applications
Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start
More informationGeneralization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh
Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds
More informationFeature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with
Feature selection c Victor Kitov v.v.kitov@yandex.ru Summer school on Machine Learning in High Energy Physics in partnership with August 2015 1/38 Feature selection Feature selection is a process of selecting
More informationARTIFICIAL INTELLIGENCE LABORATORY. and CENTER FOR BIOLOGICAL INFORMATION PROCESSING. A.I. Memo No August Federico Girosi.
MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL INFORMATION PROCESSING WHITAKER COLLEGE A.I. Memo No. 1287 August 1991 C.B.I.P. Paper No. 66 Models of
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationOn A-distance and Relative A-distance
1 ADAPTIVE COMMUNICATIONS AND SIGNAL PROCESSING LABORATORY CORNELL UNIVERSITY, ITHACA, NY 14853 On A-distance and Relative A-distance Ting He and Lang Tong Technical Report No. ACSP-TR-08-04-0 August 004
More information1/sqrt(B) convergence 1/B convergence B
The Error Coding Method and PICTs Gareth James and Trevor Hastie Department of Statistics, Stanford University March 29, 1998 Abstract A new family of plug-in classication techniques has recently been
More information1 Introduction It will be convenient to use the inx operators a b and a b to stand for maximum (least upper bound) and minimum (greatest lower bound)
Cycle times and xed points of min-max functions Jeremy Gunawardena, Department of Computer Science, Stanford University, Stanford, CA 94305, USA. jeremy@cs.stanford.edu October 11, 1993 to appear in the
More informationLinear Models for Classification
Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,
More informationNeural Network Learning: Testing Bounds on Sample Complexity
Neural Network Learning: Testing Bounds on Sample Complexity Joaquim Marques de Sá, Fernando Sereno 2, Luís Alexandre 3 INEB Instituto de Engenharia Biomédica Faculdade de Engenharia da Universidade do
More informationSupport Vector Regression with Automatic Accuracy Control B. Scholkopf y, P. Bartlett, A. Smola y,r.williamson FEIT/RSISE, Australian National University, Canberra, Australia y GMD FIRST, Rudower Chaussee
More informationComputation Of Asymptotic Distribution. For Semiparametric GMM Estimators. Hidehiko Ichimura. Graduate School of Public Policy
Computation Of Asymptotic Distribution For Semiparametric GMM Estimators Hidehiko Ichimura Graduate School of Public Policy and Graduate School of Economics University of Tokyo A Conference in honor of
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationOn Learnability, Complexity and Stability
On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationTHE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY
THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY Dan A. Simovici UMB, Doctoral Summer School Iasi, Romania What is Machine Learning? The Vapnik-Chervonenkis Dimension Probabilistic Learning Potential
More informationSupport Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM
1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University
More informationMargin Maximizing Loss Functions
Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing
More informationKonrad-Zuse-Zentrum für Informationstechnik Berlin Takustraße 7, D Berlin
Konrad-Zuse-Zentrum für Informationstechnik Berlin Takustraße 7, D-14195 Berlin Georg Ch. Pug Andrzej Ruszczynski Rudiger Schultz On the Glivenko-Cantelli Problem in Stochastic Programming: Mixed-Integer
More informationDimensionality in the Stability of the Brunn-Minkowski Inequality: A blessing or a curse?
Dimensionality in the Stability of the Brunn-Minkowski Inequality: A blessing or a curse? Ronen Eldan, Tel Aviv University (Joint with Bo`az Klartag) Berkeley, September 23rd 2011 The Brunn-Minkowski Inequality
More informationIn: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999
In: Advances in Intelligent Data Analysis (AIDA), Computational Intelligence Methods and Applications (CIMA), International Computer Science Conventions Rochester New York, 999 Feature Selection Based
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More information2 JOSE BURILLO It was proved by Thurston [2, Ch.8], using geometric methods, and by Gersten [3], using combinatorial methods, that the integral 3-dime
DIMACS Series in Discrete Mathematics and Theoretical Computer Science Volume 00, 1997 Lower Bounds of Isoperimetric Functions for Nilpotent Groups Jose Burillo Abstract. In this paper we prove that Heisenberg
More informationRademacher Averages and Phase Transitions in Glivenko Cantelli Classes
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1, JANUARY 2002 251 Rademacher Averages Phase Transitions in Glivenko Cantelli Classes Shahar Mendelson Abstract We introduce a new parameter which
More informationMultilayer neural networks and polyhedral dichotomies
Annals of Mathematics and Artificial Intelligence 24 (1998) 115 128 115 Multilayer neural networks and polyhedral dichotomies Claire Kenyon a and Hélène Paugam-Moisy b a LRI, CNRS URA 410, Université Paris-Sud,
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationICML '97 and AAAI '97 Tutorials
A Short Course in Computational Learning Theory: ICML '97 and AAAI '97 Tutorials Michael Kearns AT&T Laboratories Outline Sample Complexity/Learning Curves: nite classes, Occam's VC dimension Razor, Best
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationThe Uniformity Principle: A New Tool for. Probabilistic Robustness Analysis. B. R. Barmish and C. M. Lagoa. further discussion.
The Uniformity Principle A New Tool for Probabilistic Robustness Analysis B. R. Barmish and C. M. Lagoa Department of Electrical and Computer Engineering University of Wisconsin-Madison, Madison, WI 53706
More informationThe learning task is to use examples of classied points to be able to correctly classify all possible points. In neural network learning, decision bou
Decision Region Approximation by Polynomials or Neural Networks Kim L. Blackmore Iven M.Y. Mareels Robert C. Williamson Abstract We give degree of approximation results for decision regions which are dened
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationbelow, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing
Kernel PCA Pattern Reconstruction via Approximate Pre-Images Bernhard Scholkopf, Sebastian Mika, Alex Smola, Gunnar Ratsch, & Klaus-Robert Muller GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany fbs,
More informationTechinical Proofs for Nonlinear Learning using Local Coordinate Coding
Techinical Proofs for Nonlinear Learning using Local Coordinate Coding 1 Notations and Main Results Denition 1.1 (Lipschitz Smoothness) A function f(x) on R d is (α, β, p)-lipschitz smooth with respect
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationClassier Selection using the Predicate Depth
Tech Report: MSR-TR-2013-8 Classier Selection using the Predicate Depth Ran Gilad-Bachrach Microsoft Research, 1 Microsoft way, Redmond, WA. rang@microsoft.com Chris J.C. Burges Microsoft Research, 1 Microsoft
More informationIntelligent Systems Discriminative Learning, Neural Networks
Intelligent Systems Discriminative Learning, Neural Networks Carsten Rother, Dmitrij Schlesinger WS2014/2015, Outline 1. Discriminative learning 2. Neurons and linear classifiers: 1) Perceptron-Algorithm
More informationBayesian Decision Theory
Introduction to Pattern Recognition [ Part 4 ] Mahdi Vasighi Remarks It is quite common to assume that the data in each class are adequately described by a Gaussian distribution. Bayesian classifier is
More informationMachine Learning Support Vector Machines. Prof. Matteo Matteucci
Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way
More informationA characterization of consistency of model weights given partial information in normal linear models
Statistics & Probability Letters ( ) A characterization of consistency of model weights given partial information in normal linear models Hubert Wong a;, Bertrand Clare b;1 a Department of Health Care
More informationPart 1: Overview of the Probably Approximately Correct (PAC) Learning Framework David Haussler Baskin Center for Computer Engine
Part 1: Overview of the Probably Approximately Correct (PAC) Learning Framework David Haussler haussler@cse.ucsc.edu Baskin Center for Computer Engineering and Information Sciences University of California,
More informationSupport Vector Machines vs Multi-Layer. Perceptron in Particle Identication. DIFI, Universita di Genova (I) INFN Sezione di Genova (I) Cambridge (US)
Support Vector Machines vs Multi-Layer Perceptron in Particle Identication N.Barabino 1, M.Pallavicini 2, A.Petrolini 1;2, M.Pontil 3;1, A.Verri 4;3 1 DIFI, Universita di Genova (I) 2 INFN Sezione di Genova
More informationInternational Journal "Information Theories & Applications" Vol.14 /
International Journal "Information Theories & Applications" Vol.4 / 2007 87 or 2) Nˆ t N. That criterion and parameters F, M, N assign method of constructing sample decision function. In order to estimate
More informationy(n) Time Series Data
Recurrent SOM with Local Linear Models in Time Series Prediction Timo Koskela, Markus Varsta, Jukka Heikkonen, and Kimmo Kaski Helsinki University of Technology Laboratory of Computational Engineering
More informationIntroduction to Machine Learning
Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical
More informationESANN'1999 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 1999, D-Facto public., ISBN X, pp.
Statistical mechanics of support vector machines Arnaud Buhot and Mirta B. Gordon Department de Recherche Fondamentale sur la Matiere Condensee CEA-Grenoble, 17 rue des Martyrs, 38054 Grenoble Cedex 9,
More informationCarnegie Mellon University Forbes Ave. Pittsburgh, PA 15213, USA. fmunos, leemon, V (x)ln + max. cost functional [3].
Gradient Descent Approaches to Neural-Net-Based Solutions of the Hamilton-Jacobi-Bellman Equation Remi Munos, Leemon C. Baird and Andrew W. Moore Robotics Institute and Computer Science Department, Carnegie
More informationDoes Unlabeled Data Help?
Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline
More informationIdentication and Control of Nonlinear Systems Using. Neural Network Models: Design and Stability Analysis. Marios M. Polycarpou and Petros A.
Identication and Control of Nonlinear Systems Using Neural Network Models: Design and Stability Analysis by Marios M. Polycarpou and Petros A. Ioannou Report 91-09-01 September 1991 Identication and Control
More informationCoins with arbitrary weights. Abstract. Given a set of m coins out of a collection of coins of k unknown distinct weights, we wish to
Coins with arbitrary weights Noga Alon Dmitry N. Kozlov y Abstract Given a set of m coins out of a collection of coins of k unknown distinct weights, we wish to decide if all the m given coins have the
More informationSpecial Classes of Fuzzy Integer Programming Models with All-Dierent Constraints
Transaction E: Industrial Engineering Vol. 16, No. 1, pp. 1{10 c Sharif University of Technology, June 2009 Special Classes of Fuzzy Integer Programming Models with All-Dierent Constraints Abstract. K.
More informationA Simple Algorithm for Learning Stable Machines
A Simple Algorithm for Learning Stable Machines Savina Andonova and Andre Elisseeff and Theodoros Evgeniou and Massimiliano ontil Abstract. We present an algorithm for learning stable machines which is
More information290 J.M. Carnicer, J.M. Pe~na basis (u 1 ; : : : ; u n ) consisting of minimally supported elements, yet also has a basis (v 1 ; : : : ; v n ) which f
Numer. Math. 67: 289{301 (1994) Numerische Mathematik c Springer-Verlag 1994 Electronic Edition Least supported bases and local linear independence J.M. Carnicer, J.M. Pe~na? Departamento de Matematica
More informationIn Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.
In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous
More informationCurves clustering with approximation of the density of functional random variables
Curves clustering with approximation of the density of functional random variables Julien Jacques and Cristian Preda Laboratoire Paul Painlevé, UMR CNRS 8524, University Lille I, Lille, France INRIA Lille-Nord
More informationVapnik-Chervonenkis Dimension of Neural Nets
P. L. Bartlett and W. Maass: Vapnik-Chervonenkis Dimension of Neural Nets 1 Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley
More information1 Introduction Tasks like voice or face recognition are quite dicult to realize with conventional computer systems, even for the most powerful of them
Information Storage Capacity of Incompletely Connected Associative Memories Holger Bosch Departement de Mathematiques et d'informatique Ecole Normale Superieure de Lyon Lyon, France Franz Kurfess Department
More informationIntroduction to machine learning
1/59 Introduction to machine learning Victor Kitov v.v.kitov@yandex.ru 1/59 Course information Instructor - Victor Vladimirovich Kitov Tasks of the course Structure: Tools lectures, seminars assignements:
More informationQUASI-UNIFORMLY POSITIVE OPERATORS IN KREIN SPACE. Denitizable operators in Krein spaces have spectral properties similar to those
QUASI-UNIFORMLY POSITIVE OPERATORS IN KREIN SPACE BRANKO CURGUS and BRANKO NAJMAN Denitizable operators in Krein spaces have spectral properties similar to those of selfadjoint operators in Hilbert spaces.
More informationNear convexity, metric convexity, and convexity
Near convexity, metric convexity, and convexity Fred Richman Florida Atlantic University Boca Raton, FL 33431 28 February 2005 Abstract It is shown that a subset of a uniformly convex normed space is nearly
More informationSUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION
SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology
More informationLinear Regression and Discrimination
Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian
More informationThe information-theoretic value of unlabeled data in semi-supervised learning
The information-theoretic value of unlabeled data in semi-supervised learning Alexander Golovnev Dávid Pál Balázs Szörényi January 5, 09 Abstract We quantify the separation between the numbers of labeled
More informationLEARNING & LINEAR CLASSIFIERS
LEARNING & LINEAR CLASSIFIERS 1/26 J. Matas Czech Technical University, Faculty of Electrical Engineering Department of Cybernetics, Center for Machine Perception 121 35 Praha 2, Karlovo nám. 13, Czech
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationSGN (4 cr) Chapter 5
SGN-41006 (4 cr) Chapter 5 Linear Discriminant Analysis Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology January 21, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006
More informationThe Complexity and Approximability of Finding. Maximum Feasible Subsystems of Linear Relations. Abstract
The Complexity and Approximability of Finding Maximum Feasible Subsystems of Linear Relations Edoardo Amaldi Department of Mathematics Swiss Federal Institute of Technology CH-1015 Lausanne amaldi@dma.epfl.ch
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationfor average case complexity 1 randomized reductions, an attempt to derive these notions from (more or less) rst
On the reduction theory for average case complexity 1 Andreas Blass 2 and Yuri Gurevich 3 Abstract. This is an attempt to simplify and justify the notions of deterministic and randomized reductions, an
More information= w 2. w 1. B j. A j. C + j1j2
Local Minima and Plateaus in Multilayer Neural Networks Kenji Fukumizu and Shun-ichi Amari Brain Science Institute, RIKEN Hirosawa 2-, Wako, Saitama 35-098, Japan E-mail: ffuku, amarig@brain.riken.go.jp
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I
Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 2012 Engineering Part IIB: Module 4F10 Introduction In
More informationSupport Vector Machines
Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6
More informationFunctional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...
Functional Analysis Franck Sueur 2018-2019 Contents 1 Metric spaces 1 1.1 Definitions........................................ 1 1.2 Completeness...................................... 3 1.3 Compactness......................................
More informationLearning Kernels -Tutorial Part III: Theoretical Guarantees.
Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley
More informationVector Space Basics. 1 Abstract Vector Spaces. 1. (commutativity of vector addition) u + v = v + u. 2. (associativity of vector addition)
Vector Space Basics (Remark: these notes are highly formal and may be a useful reference to some students however I am also posting Ray Heitmann's notes to Canvas for students interested in a direct computational
More information12 CHAPTER 1. PRELIMINARIES Lemma 1.3 (Cauchy-Schwarz inequality) Let (; ) be an inner product in < n. Then for all x; y 2 < n we have j(x; y)j (x; x)
1.4. INNER PRODUCTS,VECTOR NORMS, AND MATRIX NORMS 11 The estimate ^ is unbiased, but E(^ 2 ) = n?1 n 2 and is thus biased. An unbiased estimate is ^ 2 = 1 (x i? ^) 2 : n? 1 In x?? we show that the linear
More informationPointwise convergence rate for nonlinear conservation. Eitan Tadmor and Tao Tang
Pointwise convergence rate for nonlinear conservation laws Eitan Tadmor and Tao Tang Abstract. We introduce a new method to obtain pointwise error estimates for vanishing viscosity and nite dierence approximations
More informationLogistic Regression. COMP 527 Danushka Bollegala
Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will
More information> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel
Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation
More informationMachine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and
More informationVapnik-Chervonenkis Dimension of Neural Nets
Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley Department of Statistics 367 Evans Hall, CA 94720-3860, USA bartlett@stat.berkeley.edu
More information