Estimating the sample complexity of a multi-class. discriminant model

Size: px
Start display at page:

Download "Estimating the sample complexity of a multi-class. discriminant model"

Transcription

1 Estimating the sample complexity of a multi-class discriminant model Yann Guermeur LIP, UMR NRS 0, Universit Paris, place Jussieu, 55 Paris cedex 05 Yann.Guermeur@lip.fr Andr Elissee and H l ne Paugam-Moisy ERI, Universit Lumi re Lyon 5, avenue Pierre Mend s-france, 9 Bron cedex {aelissee,hpaugam}@univ-lyon.fr Abstract We study the generalization performance of a multi-class discriminant model. Several bounds on its sample complexity are derived from uniform convergence results based on dierent measures of capacity. This gives us an insight into the nature of the capacity measure which is best suited to study multi-class discriminant model. 1 Introduction Since the pioneering work of Vapnik and hervonenkis [1], extending the classical Glivenko- antelli theorem to give a uniform convergence result over classes of indicator functions, many studies have dealt with uniform strong laws of large numbers (see for instance [10, ]). The bound they provide both in pattern recognition and regression estimation can readily be used to derive a sample complexity which is an increasing function of a particular measure of the capacity of the model considered. The choice of the appropriate bound as well as the computation of a tight upper bound on the capacity measure thus appear to be of central importance to derive a tight bound on the sample complexity. This is the subject we investigate in this paper, throughout an application on the Multivariate Linear Regression (MLR) combiner described in []. Up to now, very few studies in statistical learning theory have dealt with multi-class discrimination. Section briey outlines the implementation of the MLR model for class posterior probability estimates combination. Section is devoted to the estimation of the sample complexity using two combinatorial quantities generalizing the Vapnik-hervonenkis (V) dimension and called the graph dimension and the Natarajan dimension. In section, another bound is derived from a theorem where the capacity of the family of functions is characterized by its covering number. Both bounds are discussed in section 5 and further improvements are proposed in perspective. MLR combiner for classi- er combination We consider a -category discrimination task, under the usual hypothesis that there is a joint distribution, xed but unknown, on S = X Y, where X is the input space and Y the set of categories. We further assume that, for each input pattern x X, the outputs of P classiers are available. Let f j denote the function computed by the j th of these classiers: f j (x) = [f jk (x)] R. The k th output f jk (x) approximates the class posterior probability p( k jx). Precisely, f j (x) U with U = nu R + =1T u = 1 o 1

2 In other words, the outputs are non-negative and sum to 1. Let F (x) = [f j (x)], (1 j P) (F (x) U P ) be a vector of predictors. The MLR model studied here, parameterized by v = [v k ] R =P, computes the functions g G given by: g(x) = g 1 (x). g k (x). g (x) 5 = v T 1. v T ḳ. v T 5 F (x) Let be the set of loss functions satisfying the general conditions for outputs to be interpreted as probabilities (see for instance []) and s a N- sample of observations (s S N ). Among the functions of G, the MLR combiner is any of the functions which constitutes a solution to the following optimization problem: Problem 1 Given a convex loss function L and a N-sample s S N, nd a function in G minimizing the empirical risk ^J(v) and taking its values in U. The MLR combiner is thus designed to take as inputs class posterior probability estimates and output better estimates with respect to some given criterion (least-squares, cross-entropy : : :). v k;l;m, the general term of v k, is the coecient associated with the predictor f lm (x) in the regression computed to estimate p( k jx). Let v be the vector of all the parameters (v = [v k ] R P = ). As was pointed out in [], optimal solutions to Problem 1 are obtained by minimizing ^J(v) subject to v V with V = P v R P 8(l; + = m); k(v k;l;m? vk;l;) = 0 1 T P v = This result holds irrespective of the choice of L and s. Details on the computation of a global minimum can be found in [5]. To sum up, the model we study is a multiple-output perceptron, under the constraints 8x X; F (x) U P and v V. One can easily verify that the Euclidean norm kf (x)k of every vector in U P is bounded above by p P. Furthermore, solving a simple quadratic programming problem establishes that for all v V and k f1; : : : ; g, kv k k p. Growth function bounds For now on, we consider the MLR combiner as a discriminant model, by application of Bayes' estimated decision rule. Although the learning process described in the previous section only amounts to estimating probability densities on a nite sample, the criterion of interest, the generalization ability in terms of recognition rate, can be both estimated and controlled. To perform this task, we use results derived in the framework of the computational learning theory and the statistical learning theory introduced by Vapnik. Roughly speaking, they represent corollaries of uniform strong laws of large numbers. The theorems used in this section are derived from bounds grounded on a combinatorial capacity measure called the growth function. In order to express them, we must rst introduce additional notations and denitions. Let H be a family of functions from X to a nite set Y. E s (h) and E(h) respectively designate the error on s (observed error) and the generalization error of a function h belonging to H. Denition 1 Let H be a set of indicator functions (values in {0,1}). Let s X be a N-sample of X and H (s X ) the number of dierent classications of s X by the functions of H. The growth function H is dened by: H (N ) = max H (s X ) : s X X N : Denition The V dimension of a set H of indicator functions is the maximum number d of vectors that can be shattered, i.e. separated into two classes in all d possible ways, using functions of H. If this maximum does not exist, the V dimension is equal to innity.

3 These denitions only apply to sets of indicator functions. The following theorem, dealing with multiple-output functions, appears as an immediate corollary of a result due to Vapnik [1] (see for instance [1]): Theorem 1 Let s S N. With probability 1?, E(h) < E s (h)+ r 1 N (ln( GH (N ))? ln())+ 1 N GH, the graph space of H, is dened as follows: For h H, let Gh be the function from X Y to f0; 1g dened by Gh(x; y) = 1 () h(x) = y GH = fgh : h Hg. GH is the growth function of GH [9]. Thus, in order to obtain an upper bound on the condence interval in Theorem 1, one must nd an upper bound on the growth function associated with the MLR combiner. This constitutes the object of the next two subsections where two dierent bounds are stated..1 Graph dimension of the MLR combiner The discriminant functions computed by the MLR combiner are also computed by the single hidden layer perceptron with threshold activation functions depicted in Figure 1. Several articles have been devoted to bounding the growth function and V dimension of multilayer perceptrons [9, 11]. Proceeding as in [11], one can use as upper bound on the growth function the product of the bounds on the growth functions of the individual hidden units derived from Sauer's lemma. Since the V dimension of each hidden unit is at most equal to the dimension of the smallest subspace of R P containing U P and this dimension is d E = P (? 1) + 1 [], we get: GMLP (N ) < en d E (1) d E F(x) Px Figure 1: v1-v v1-v v1-v v1-v v-v v-v v-v v-v v-v v(-)-v(-1) v(-)-v v(-1)-v Bayes Architecture of a multi-layer perceptron with threshold units computing the same discriminant function as the MLR combiner. Each hidden unit computes a function h i;j (F (x)) = t? F (x) T (v i? v j ), (1 i < j ), where t(z) = 1 if z > 0 and t(z) =?1 otherwise. The weights of the output layer, either +1 (solid lines) or?1 (dashed lines), and the biases are chosen so that the output units compute a logical AND. The number below each layer corresponds to the number of units. This bound can be signicantly improved, by making use of the dependences between the hidden units. For lack of place, we only exhibit here a simple way to do so. Let x s X. Among all the possible classications performed by the hidden unit whose vector is v 1? v, exactly half of them associate to F (x) the value 1. The same is true for the hidden unit whose vector is v? v. Now if these two units provide the same output for F (x), then the output for the hidden unit whose vector is v 1? v is known. onsequently, the growth function associated with these three units is at most en de de. Proceedings step by step, we thus get an improved bound: GMLP (N ) < g(x)?1 en d E () d E

4 . Natarajan dimension Natarajan introduced in [9] an extension of the V dimension for multiple-output functions based on the following denition of shattering: Denition A set H of discrete-valued functions shatters a set s X of vectors () there exist two functions h 1 and h belonging to H such that: (a) for any x s X, h 1 (x) = h (x), (b) for all s 1 s X, there exists h H such that h agrees with h 1 on s 1 and with h on s X n s 1, i.e. 8x s 1, h (x) = h 1 (x), 8x s X n s 1, h (x) = h (x). We established the following result in [] Theorem The Natarajan dimension d N of the MLR combiner satises: P + b 1 c d N (d E? 1) In [8], the following theorem was proved Theorem Let N c H (N ) be the number of dierent classications performed by a set of functions H on a set of size N. Let d N be the Natarajan dimension of H. Then, for all d d N : X i=d N c H (N ) N i i=0 We have the following inequality: Xi=d i=0 i N? i? i en d d () Obviously, GH (N ) N c H (N ). Thus, substituting the upper bound on d N provided by Theorem for d in the formula of Theorem and applying () we get: GMLP (N ) < en (d E? 1)! (d E?1) () This last bound is very similar to those provided by (1) and (). This is a good indication that these bounds are quite loose, since bounding GH (N ) by N c H (N ) is crude. In fact, the diculty with the use of the growth function lies in the way to take into account the specicity of the model (the link between the hidden units), when applying a generalization of Sauer's lemma. This diculty concerns all the approaches using combinatorial methods which do not consider how the model is built. This led us to consider an alternative definition of the capacity of the combiner which is an original method to have condence bounds for multi-class discriminant models based on the computation of the a posteriori probability p( k jx). overing number bounds Many studies have been devoted to giving rates of uniform convergence of the means to their expectations, based on a capacity measure called the covering number [1, 10, ]. To dene this measure, we rst introduce the notion of -cover in a pseudo-metric space. Denition Let (E; ) be a pseudo-metric space. A nite set T E is an?cover of a set H E with respect to if for all h H there is a h T such that (h; h). Denition 5 Given and, the covering number N (; H; ) of H is the size of the smallest -cover of H. overing numbers were introduced in learning theory to derive bounds for regression estimation problems. In [], Bartlett has extended their use to discrimination, specically to dichotomy computation. For now on, H is supposed to be a set of functions taking their values in R. The discriminant function associated with any of the functions h H is the function t h, where t is the sign function dened in the caption of Figure 1. E(h) is dened accordingly.

5 Denition Let : R! [?; ] be a piecewiselinear function such that: t(x) if jxj (x) = x otherwise Let us dene H = f (h); h Hg and N 1 (; H; N ) = max sx N (; H; k:k sx 1 ), where khk sx 1 = max 1iN jh(x i )j. Let the empirical error according to the margin be dened as: Denition E s (h) = 1 N jf(x i; y i ) s=h(x i )y i < gj We have: Theorem Suppose > 0, and 0 < < 1=, then, with probability at least 1?, s E(h) Es (h) + N1 N ln (=; H ; N ) The MLR model takes its values in [0; 1]. For each pattern x, the desired output y is the canonical coding of the category of x. If jg k (x)? y k j < 1= for all k, then the use of Bayes' estimated decision rule will provide the correct classication for x. A contrario, an error will occur only if jg k (x)? y k j 1= for at least one k. For all k, we dene E(g k ) as being the error for g k? 1= and E(g) as the error in generalization of the discriminant function associated with g. Then, we have: E(g) X k=1 E(g k ) (5) As in the previous section, bounding the generalization ability of the MLR combiner amounts to bounding a measure of complexity. This time however, this measure is dened for each individual function g k. The end of this section thus deals with stating bounds on N 1 (=; H ; N ), where H equals fg k?1=g for any k in f1; : : : ; g. Since the function is 1-lipschitzian, N 1 (=; H ; N ) N 1 (=; H; N ) Let T be the ane application dened over B P (the unit ball of R P endowed with k:k ) as: T : B P! R N w! p [F (x 1 ) T w; : : : ; F (x N ) T w] T? (1=) N It has the property that f[f (x 1 ) T w? 1=; : : : ; F (x N ) T w? 1=] T s.t. kwk p g T (B P ). Hence, since kv k k p, N 1 (=; H; N ) According to [], we have, N (=; T (B P ); k:k 1 ) max s 0 X XN N (=; T (B P); k:k 1 ) ()! P 8k T ~ k () where, T ~ is the linear operator associated with T and k T ~ k Twk k = sup ~ 1 wbp kwk. Since kf (x)k p P, from auchy-schwarz inequality k T ~ k is bounded above by p P. By injection of () in () and by ap plying Theorem to bound the right hand side of (5), it yields E(g) k= X k=1 E k s (g k) + s N ln + P ln 8p P k! This bound on the error should be related to bound derived from theorem 1 by substituting GH (N ) by the successive bounds established in section. 5 Discussion and onclusion This paper analyses two strategies to derive bounds on the generalization error for a multiclass discriminant model. One is based on combinatorial dimensions and the other uses covering numbers. The bounds on the generalization ability derived in the two former sections cannot be readily compared, since they rest on dierent denitions of the empirical error. A direct application of these results on a biological application [5] 5

6 leads to similar bounds in both cases which are roughly 5%. These bounds are not accurate for such an application. In their current implementation, both methods fail to provide useful bounds for real-world application. However, this study sheds lights on some features which could be used to improve the generalization control so as to make it of practical interest. As pointed out in section, the combinatorial approach does not take into account the specicity of the model and should be avoided as it is used here. Improving the Natarajan dimension will not indeed decrease the gap between GH (N ) and N ch (N ), which is one of the most important Achilles' heels of the method. To use knowledge of the structure when bounding the generalization error, we introduce a method based on covering numbers which is new for multi-class discrimination. This method has the disadvantage P however to crudely bound E(g) by the sum k=1 E(g k). This is the main bug of the method and future works should x it somehow. One way to do so is to develop a global approach by directly controlling the generalization error of E(g) in terms of covering numbers instead of controlling the generalization errors E(g k ). This will be the subject of our next research. Thus, by presenting a new approach for the study of multi-class discriminant model with real internal representations, we have stated new bounds and pointed out a way to improve them. overing numbers allow to use knowledge of the learning system and to include it in the condence bounds. Further work will be to derive more practical bounds for the model of interest. References [1] M. Anthony (199): Probabilistic Analysis of Learning in Articial Neural Networks: The PA Model and its Variants. Neural omputing Surveys, Vol. 1, 1-. [] P. Bartlett (199): The sample complexity of pattern classication with neural networks: the size of the weights is more important than the size of the network. Technical report, Department of Systems Engineering, Australian National University, ftp : syseng.anu.edu.au:pub/peter/tr9d.ps. [].M. Bishop (1995): Neural Networks for Pattern Recognition. larendon Press, Oxford. [] B. arl and I. Stephani (1990): Entropy, compactness, and the approximation of operators. ambridge University Press, ambridge, UK. [5] Y. Guermeur,. Geourjon, P. Gallinari and G. Del age (1999): Improved Performance in Protein Secondary Structure Prediction by Inhomogeneous Score ombination. To appear in Bioinformatics. [] Y. Guermeur, H. Paugam-Moisy and P. Gallinari (1998): Multivariate Linear Regression on lassi- er Outputs: a apacity Study. IANN'98, [] D. Haussler (199): Decision Theoretic Generalizations of the PA Model for Neural Net and Other Learning Applications. Information and omputation, 100, [8] D. Haussler and P.M. Long. (1995): A Generalization of Sauer's Lemma. Journal of ombinatorial Theory, Series A, 1, [9] B.K. Natarajan (1989): On Learning Sets and Functions. Machine Learning, Vol., -9. [10] D. Pollard (198): onvergence of Stochastic Processes. Springer Series in Statistics, Springer- Verlag, N.Y. [11] J. Shawe-Taylor and M. Anthony (1991): Sample sizes for multiple-output threshold networks. Network: omputation in Neural Systems, Vol., [1] V.N. Vapnik (198): Estimation of Dependences Based on Empirical Data. Springer-Verlag, N.Y. [1] V.N. Vapnik (1998): Statistical Learning Theory. John Wiley & Sons, IN., N.Y. [1] V.N. Vapnik and A.Y. hervonenkis (191): On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Application, Vol. 1, -80.

Sample width for multi-category classifiers

Sample width for multi-category classifiers R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University

More information

Optimal Linear Regression on Classifier Outputs

Optimal Linear Regression on Classifier Outputs Optimal Linear Regression on Classifier Outputs Yann Guermeur, Florence d Alché-Buc and Patrick Gallinari LIP6, Université Pierre et Marie Curie Tours 46-, Boîte 169 4, Place Jussieu, 75252 Paris cedex

More information

Data-Dependent Structural Risk. Decision Trees. John Shawe-Taylor. Royal Holloway, University of London 1. Nello Cristianini. University of Bristol 2

Data-Dependent Structural Risk. Decision Trees. John Shawe-Taylor. Royal Holloway, University of London 1. Nello Cristianini. University of Bristol 2 Data-Dependent Structural Risk Minimisation for Perceptron Decision Trees John Shawe-Taylor Royal Holloway, University of London 1 Email: jst@dcs.rhbnc.ac.uk Nello Cristianini University of Bristol 2 Email:

More information

Model Selection for Multi-class SVMs

Model Selection for Multi-class SVMs Model Selection for Multi-class SVMs Yann Guermeur 1, Myriam Maumy 2, and Frédéric Sur 1 1 LORIA-CNRS Campus Scientifique, BP 239, 54506 Vandœuvre-lès-Nancy Cedex, France (e-mail: Yann.Guermeur@loria.fr,

More information

The best expert versus the smartest algorithm

The best expert versus the smartest algorithm Theoretical Computer Science 34 004 361 380 www.elsevier.com/locate/tcs The best expert versus the smartest algorithm Peter Chen a, Guoli Ding b; a Department of Computer Science, Louisiana State University,

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

A Result of Vapnik with Applications

A Result of Vapnik with Applications A Result of Vapnik with Applications Martin Anthony Department of Statistical and Mathematical Sciences London School of Economics Houghton Street London WC2A 2AE, U.K. John Shawe-Taylor Department of

More information

1 GMD FIRST, Rudower Chaussee 5, Berlin, Germany 2 Department of Computer Science, Royal Holloway College, U

1 GMD FIRST, Rudower Chaussee 5, Berlin, Germany 2 Department of Computer Science, Royal Holloway College, U Generalization Bounds via Eigenvalues of the Gram Matrix Bernhard Scholkopf, GMD 1 John Shawe-Taylor, University of London 2 Alexander J. Smola, GMD 3 Robert C. Williamson, ANU 4 NeuroCOLT2 Technical Report

More information

Statistical learning theory, Support vector machines, and Bioinformatics

Statistical learning theory, Support vector machines, and Bioinformatics 1 Statistical learning theory, Support vector machines, and Bioinformatics Jean-Philippe.Vert@mines.org Ecole des Mines de Paris Computational Biology group ENS Paris, november 25, 2003. 2 Overview 1.

More information

Martin Anthony. The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom.

Martin Anthony. The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom. Function Learning from Interpolation 1 Martin Anthony Department of Mathematics, The London School of Economics and Political Science, Houghton Street, London WC2A 2AE, United Kingdom. Email anthony@vax.lse.ac.uk

More information

Fat-shattering. and the learnability of real-valued functions. Peter L. Bartlett. Department of Systems Engineering

Fat-shattering. and the learnability of real-valued functions. Peter L. Bartlett. Department of Systems Engineering Fat-shattering and the learnability of real-valued functions Peter L. Bartlett Department of Systems Engineering Research School of Information Sciences and Engineering Australian National University Canberra,

More information

Statistical Learning Theory

Statistical Learning Theory Statistical Learning Theory Fundamentals Miguel A. Veganzones Grupo Inteligencia Computacional Universidad del País Vasco (Grupo Inteligencia Vapnik Computacional Universidad del País Vasco) UPV/EHU 1

More information

Plan of Class 4. Radial Basis Functions with moving centers. Projection Pursuit Regression and ridge. Principal Component Analysis: basic ideas

Plan of Class 4. Radial Basis Functions with moving centers. Projection Pursuit Regression and ridge. Principal Component Analysis: basic ideas Plan of Class 4 Radial Basis Functions with moving centers Multilayer Perceptrons Projection Pursuit Regression and ridge functions approximation Principal Component Analysis: basic ideas Radial Basis

More information

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization : Neural Networks Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization 11s2 VC-dimension and PAC-learning 1 How good a classifier does a learner produce? Training error is the precentage

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Lecture 4: Deep Learning Essentials Pierre Geurts, Gilles Louppe, Louis Wehenkel 1 / 52 Outline Goal: explain and motivate the basic constructs of neural networks. From linear

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with Feature selection c Victor Kitov v.v.kitov@yandex.ru Summer school on Machine Learning in High Energy Physics in partnership with August 2015 1/38 Feature selection Feature selection is a process of selecting

More information

ARTIFICIAL INTELLIGENCE LABORATORY. and CENTER FOR BIOLOGICAL INFORMATION PROCESSING. A.I. Memo No August Federico Girosi.

ARTIFICIAL INTELLIGENCE LABORATORY. and CENTER FOR BIOLOGICAL INFORMATION PROCESSING. A.I. Memo No August Federico Girosi. MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL INFORMATION PROCESSING WHITAKER COLLEGE A.I. Memo No. 1287 August 1991 C.B.I.P. Paper No. 66 Models of

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

On A-distance and Relative A-distance

On A-distance and Relative A-distance 1 ADAPTIVE COMMUNICATIONS AND SIGNAL PROCESSING LABORATORY CORNELL UNIVERSITY, ITHACA, NY 14853 On A-distance and Relative A-distance Ting He and Lang Tong Technical Report No. ACSP-TR-08-04-0 August 004

More information

1/sqrt(B) convergence 1/B convergence B

1/sqrt(B) convergence 1/B convergence B The Error Coding Method and PICTs Gareth James and Trevor Hastie Department of Statistics, Stanford University March 29, 1998 Abstract A new family of plug-in classication techniques has recently been

More information

1 Introduction It will be convenient to use the inx operators a b and a b to stand for maximum (least upper bound) and minimum (greatest lower bound)

1 Introduction It will be convenient to use the inx operators a b and a b to stand for maximum (least upper bound) and minimum (greatest lower bound) Cycle times and xed points of min-max functions Jeremy Gunawardena, Department of Computer Science, Stanford University, Stanford, CA 94305, USA. jeremy@cs.stanford.edu October 11, 1993 to appear in the

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Oliver Schulte - CMPT 726 Bishop PRML Ch. 4 Classification: Hand-written Digit Recognition CHINE INTELLIGENCE, VOL. 24, NO. 24, APRIL 2002 x i = t i = (0, 0, 0, 1, 0, 0,

More information

Neural Network Learning: Testing Bounds on Sample Complexity

Neural Network Learning: Testing Bounds on Sample Complexity Neural Network Learning: Testing Bounds on Sample Complexity Joaquim Marques de Sá, Fernando Sereno 2, Luís Alexandre 3 INEB Instituto de Engenharia Biomédica Faculdade de Engenharia da Universidade do

More information

Support Vector Regression with Automatic Accuracy Control B. Scholkopf y, P. Bartlett, A. Smola y,r.williamson FEIT/RSISE, Australian National University, Canberra, Australia y GMD FIRST, Rudower Chaussee

More information

Computation Of Asymptotic Distribution. For Semiparametric GMM Estimators. Hidehiko Ichimura. Graduate School of Public Policy

Computation Of Asymptotic Distribution. For Semiparametric GMM Estimators. Hidehiko Ichimura. Graduate School of Public Policy Computation Of Asymptotic Distribution For Semiparametric GMM Estimators Hidehiko Ichimura Graduate School of Public Policy and Graduate School of Economics University of Tokyo A Conference in honor of

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

On Learnability, Complexity and Stability

On Learnability, Complexity and Stability On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY Dan A. Simovici UMB, Doctoral Summer School Iasi, Romania What is Machine Learning? The Vapnik-Chervonenkis Dimension Probabilistic Learning Potential

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Margin Maximizing Loss Functions

Margin Maximizing Loss Functions Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing

More information

Konrad-Zuse-Zentrum für Informationstechnik Berlin Takustraße 7, D Berlin

Konrad-Zuse-Zentrum für Informationstechnik Berlin Takustraße 7, D Berlin Konrad-Zuse-Zentrum für Informationstechnik Berlin Takustraße 7, D-14195 Berlin Georg Ch. Pug Andrzej Ruszczynski Rudiger Schultz On the Glivenko-Cantelli Problem in Stochastic Programming: Mixed-Integer

More information

Dimensionality in the Stability of the Brunn-Minkowski Inequality: A blessing or a curse?

Dimensionality in the Stability of the Brunn-Minkowski Inequality: A blessing or a curse? Dimensionality in the Stability of the Brunn-Minkowski Inequality: A blessing or a curse? Ronen Eldan, Tel Aviv University (Joint with Bo`az Klartag) Berkeley, September 23rd 2011 The Brunn-Minkowski Inequality

More information

In: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999

In: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999 In: Advances in Intelligent Data Analysis (AIDA), Computational Intelligence Methods and Applications (CIMA), International Computer Science Conventions Rochester New York, 999 Feature Selection Based

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

2 JOSE BURILLO It was proved by Thurston [2, Ch.8], using geometric methods, and by Gersten [3], using combinatorial methods, that the integral 3-dime

2 JOSE BURILLO It was proved by Thurston [2, Ch.8], using geometric methods, and by Gersten [3], using combinatorial methods, that the integral 3-dime DIMACS Series in Discrete Mathematics and Theoretical Computer Science Volume 00, 1997 Lower Bounds of Isoperimetric Functions for Nilpotent Groups Jose Burillo Abstract. In this paper we prove that Heisenberg

More information

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes

Rademacher Averages and Phase Transitions in Glivenko Cantelli Classes IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 1, JANUARY 2002 251 Rademacher Averages Phase Transitions in Glivenko Cantelli Classes Shahar Mendelson Abstract We introduce a new parameter which

More information

Multilayer neural networks and polyhedral dichotomies

Multilayer neural networks and polyhedral dichotomies Annals of Mathematics and Artificial Intelligence 24 (1998) 115 128 115 Multilayer neural networks and polyhedral dichotomies Claire Kenyon a and Hélène Paugam-Moisy b a LRI, CNRS URA 410, Université Paris-Sud,

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

ICML '97 and AAAI '97 Tutorials

ICML '97 and AAAI '97 Tutorials A Short Course in Computational Learning Theory: ICML '97 and AAAI '97 Tutorials Michael Kearns AT&T Laboratories Outline Sample Complexity/Learning Curves: nite classes, Occam's VC dimension Razor, Best

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

The Uniformity Principle: A New Tool for. Probabilistic Robustness Analysis. B. R. Barmish and C. M. Lagoa. further discussion.

The Uniformity Principle: A New Tool for. Probabilistic Robustness Analysis. B. R. Barmish and C. M. Lagoa. further discussion. The Uniformity Principle A New Tool for Probabilistic Robustness Analysis B. R. Barmish and C. M. Lagoa Department of Electrical and Computer Engineering University of Wisconsin-Madison, Madison, WI 53706

More information

The learning task is to use examples of classied points to be able to correctly classify all possible points. In neural network learning, decision bou

The learning task is to use examples of classied points to be able to correctly classify all possible points. In neural network learning, decision bou Decision Region Approximation by Polynomials or Neural Networks Kim L. Blackmore Iven M.Y. Mareels Robert C. Williamson Abstract We give degree of approximation results for decision regions which are dened

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing

below, kernel PCA Eigenvectors, and linear combinations thereof. For the cases where the pre-image does exist, we can provide a means of constructing Kernel PCA Pattern Reconstruction via Approximate Pre-Images Bernhard Scholkopf, Sebastian Mika, Alex Smola, Gunnar Ratsch, & Klaus-Robert Muller GMD FIRST, Rudower Chaussee 5, 12489 Berlin, Germany fbs,

More information

Techinical Proofs for Nonlinear Learning using Local Coordinate Coding

Techinical Proofs for Nonlinear Learning using Local Coordinate Coding Techinical Proofs for Nonlinear Learning using Local Coordinate Coding 1 Notations and Main Results Denition 1.1 (Lipschitz Smoothness) A function f(x) on R d is (α, β, p)-lipschitz smooth with respect

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Classier Selection using the Predicate Depth

Classier Selection using the Predicate Depth Tech Report: MSR-TR-2013-8 Classier Selection using the Predicate Depth Ran Gilad-Bachrach Microsoft Research, 1 Microsoft way, Redmond, WA. rang@microsoft.com Chris J.C. Burges Microsoft Research, 1 Microsoft

More information

Intelligent Systems Discriminative Learning, Neural Networks

Intelligent Systems Discriminative Learning, Neural Networks Intelligent Systems Discriminative Learning, Neural Networks Carsten Rother, Dmitrij Schlesinger WS2014/2015, Outline 1. Discriminative learning 2. Neurons and linear classifiers: 1) Perceptron-Algorithm

More information

Bayesian Decision Theory

Bayesian Decision Theory Introduction to Pattern Recognition [ Part 4 ] Mahdi Vasighi Remarks It is quite common to assume that the data in each class are adequately described by a Gaussian distribution. Bayesian classifier is

More information

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning Support Vector Machines. Prof. Matteo Matteucci Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way

More information

A characterization of consistency of model weights given partial information in normal linear models

A characterization of consistency of model weights given partial information in normal linear models Statistics & Probability Letters ( ) A characterization of consistency of model weights given partial information in normal linear models Hubert Wong a;, Bertrand Clare b;1 a Department of Health Care

More information

Part 1: Overview of the Probably Approximately Correct (PAC) Learning Framework David Haussler Baskin Center for Computer Engine

Part 1: Overview of the Probably Approximately Correct (PAC) Learning Framework David Haussler Baskin Center for Computer Engine Part 1: Overview of the Probably Approximately Correct (PAC) Learning Framework David Haussler haussler@cse.ucsc.edu Baskin Center for Computer Engineering and Information Sciences University of California,

More information

Support Vector Machines vs Multi-Layer. Perceptron in Particle Identication. DIFI, Universita di Genova (I) INFN Sezione di Genova (I) Cambridge (US)

Support Vector Machines vs Multi-Layer. Perceptron in Particle Identication. DIFI, Universita di Genova (I) INFN Sezione di Genova (I) Cambridge (US) Support Vector Machines vs Multi-Layer Perceptron in Particle Identication N.Barabino 1, M.Pallavicini 2, A.Petrolini 1;2, M.Pontil 3;1, A.Verri 4;3 1 DIFI, Universita di Genova (I) 2 INFN Sezione di Genova

More information

International Journal "Information Theories & Applications" Vol.14 /

International Journal Information Theories & Applications Vol.14 / International Journal "Information Theories & Applications" Vol.4 / 2007 87 or 2) Nˆ t N. That criterion and parameters F, M, N assign method of constructing sample decision function. In order to estimate

More information

y(n) Time Series Data

y(n) Time Series Data Recurrent SOM with Local Linear Models in Time Series Prediction Timo Koskela, Markus Varsta, Jukka Heikkonen, and Kimmo Kaski Helsinki University of Technology Laboratory of Computational Engineering

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Vapnik Chervonenkis Theory Barnabás Póczos Empirical Risk and True Risk 2 Empirical Risk Shorthand: True risk of f (deterministic): Bayes risk: Let us use the empirical

More information

ESANN'1999 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 1999, D-Facto public., ISBN X, pp.

ESANN'1999 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 1999, D-Facto public., ISBN X, pp. Statistical mechanics of support vector machines Arnaud Buhot and Mirta B. Gordon Department de Recherche Fondamentale sur la Matiere Condensee CEA-Grenoble, 17 rue des Martyrs, 38054 Grenoble Cedex 9,

More information

Carnegie Mellon University Forbes Ave. Pittsburgh, PA 15213, USA. fmunos, leemon, V (x)ln + max. cost functional [3].

Carnegie Mellon University Forbes Ave. Pittsburgh, PA 15213, USA. fmunos, leemon, V (x)ln + max. cost functional [3]. Gradient Descent Approaches to Neural-Net-Based Solutions of the Hamilton-Jacobi-Bellman Equation Remi Munos, Leemon C. Baird and Andrew W. Moore Robotics Institute and Computer Science Department, Carnegie

More information

Does Unlabeled Data Help?

Does Unlabeled Data Help? Does Unlabeled Data Help? Worst-case Analysis of the Sample Complexity of Semi-supervised Learning. Ben-David, Lu and Pal; COLT, 2008. Presentation by Ashish Rastogi Courant Machine Learning Seminar. Outline

More information

Identication and Control of Nonlinear Systems Using. Neural Network Models: Design and Stability Analysis. Marios M. Polycarpou and Petros A.

Identication and Control of Nonlinear Systems Using. Neural Network Models: Design and Stability Analysis. Marios M. Polycarpou and Petros A. Identication and Control of Nonlinear Systems Using Neural Network Models: Design and Stability Analysis by Marios M. Polycarpou and Petros A. Ioannou Report 91-09-01 September 1991 Identication and Control

More information

Coins with arbitrary weights. Abstract. Given a set of m coins out of a collection of coins of k unknown distinct weights, we wish to

Coins with arbitrary weights. Abstract. Given a set of m coins out of a collection of coins of k unknown distinct weights, we wish to Coins with arbitrary weights Noga Alon Dmitry N. Kozlov y Abstract Given a set of m coins out of a collection of coins of k unknown distinct weights, we wish to decide if all the m given coins have the

More information

Special Classes of Fuzzy Integer Programming Models with All-Dierent Constraints

Special Classes of Fuzzy Integer Programming Models with All-Dierent Constraints Transaction E: Industrial Engineering Vol. 16, No. 1, pp. 1{10 c Sharif University of Technology, June 2009 Special Classes of Fuzzy Integer Programming Models with All-Dierent Constraints Abstract. K.

More information

A Simple Algorithm for Learning Stable Machines

A Simple Algorithm for Learning Stable Machines A Simple Algorithm for Learning Stable Machines Savina Andonova and Andre Elisseeff and Theodoros Evgeniou and Massimiliano ontil Abstract. We present an algorithm for learning stable machines which is

More information

290 J.M. Carnicer, J.M. Pe~na basis (u 1 ; : : : ; u n ) consisting of minimally supported elements, yet also has a basis (v 1 ; : : : ; v n ) which f

290 J.M. Carnicer, J.M. Pe~na basis (u 1 ; : : : ; u n ) consisting of minimally supported elements, yet also has a basis (v 1 ; : : : ; v n ) which f Numer. Math. 67: 289{301 (1994) Numerische Mathematik c Springer-Verlag 1994 Electronic Edition Least supported bases and local linear independence J.M. Carnicer, J.M. Pe~na? Departamento de Matematica

More information

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G. In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and J. Alspector, (Eds.). Morgan Kaufmann Publishers, San Fancisco, CA. 1994. Convergence of Indirect Adaptive Asynchronous

More information

Curves clustering with approximation of the density of functional random variables

Curves clustering with approximation of the density of functional random variables Curves clustering with approximation of the density of functional random variables Julien Jacques and Cristian Preda Laboratoire Paul Painlevé, UMR CNRS 8524, University Lille I, Lille, France INRIA Lille-Nord

More information

Vapnik-Chervonenkis Dimension of Neural Nets

Vapnik-Chervonenkis Dimension of Neural Nets P. L. Bartlett and W. Maass: Vapnik-Chervonenkis Dimension of Neural Nets 1 Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley

More information

1 Introduction Tasks like voice or face recognition are quite dicult to realize with conventional computer systems, even for the most powerful of them

1 Introduction Tasks like voice or face recognition are quite dicult to realize with conventional computer systems, even for the most powerful of them Information Storage Capacity of Incompletely Connected Associative Memories Holger Bosch Departement de Mathematiques et d'informatique Ecole Normale Superieure de Lyon Lyon, France Franz Kurfess Department

More information

Introduction to machine learning

Introduction to machine learning 1/59 Introduction to machine learning Victor Kitov v.v.kitov@yandex.ru 1/59 Course information Instructor - Victor Vladimirovich Kitov Tasks of the course Structure: Tools lectures, seminars assignements:

More information

QUASI-UNIFORMLY POSITIVE OPERATORS IN KREIN SPACE. Denitizable operators in Krein spaces have spectral properties similar to those

QUASI-UNIFORMLY POSITIVE OPERATORS IN KREIN SPACE. Denitizable operators in Krein spaces have spectral properties similar to those QUASI-UNIFORMLY POSITIVE OPERATORS IN KREIN SPACE BRANKO CURGUS and BRANKO NAJMAN Denitizable operators in Krein spaces have spectral properties similar to those of selfadjoint operators in Hilbert spaces.

More information

Near convexity, metric convexity, and convexity

Near convexity, metric convexity, and convexity Near convexity, metric convexity, and convexity Fred Richman Florida Atlantic University Boca Raton, FL 33431 28 February 2005 Abstract It is shown that a subset of a uniformly convex normed space is nearly

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Linear Regression and Discrimination

Linear Regression and Discrimination Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian

More information

The information-theoretic value of unlabeled data in semi-supervised learning

The information-theoretic value of unlabeled data in semi-supervised learning The information-theoretic value of unlabeled data in semi-supervised learning Alexander Golovnev Dávid Pál Balázs Szörényi January 5, 09 Abstract We quantify the separation between the numbers of labeled

More information

LEARNING & LINEAR CLASSIFIERS

LEARNING & LINEAR CLASSIFIERS LEARNING & LINEAR CLASSIFIERS 1/26 J. Matas Czech Technical University, Faculty of Electrical Engineering Department of Cybernetics, Center for Machine Perception 121 35 Praha 2, Karlovo nám. 13, Czech

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

SGN (4 cr) Chapter 5

SGN (4 cr) Chapter 5 SGN-41006 (4 cr) Chapter 5 Linear Discriminant Analysis Jussi Tohka & Jari Niemi Department of Signal Processing Tampere University of Technology January 21, 2014 J. Tohka & J. Niemi (TUT-SGN) SGN-41006

More information

The Complexity and Approximability of Finding. Maximum Feasible Subsystems of Linear Relations. Abstract

The Complexity and Approximability of Finding. Maximum Feasible Subsystems of Linear Relations. Abstract The Complexity and Approximability of Finding Maximum Feasible Subsystems of Linear Relations Edoardo Amaldi Department of Mathematics Swiss Federal Institute of Technology CH-1015 Lausanne amaldi@dma.epfl.ch

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

for average case complexity 1 randomized reductions, an attempt to derive these notions from (more or less) rst

for average case complexity 1 randomized reductions, an attempt to derive these notions from (more or less) rst On the reduction theory for average case complexity 1 Andreas Blass 2 and Yuri Gurevich 3 Abstract. This is an attempt to simplify and justify the notions of deterministic and randomized reductions, an

More information

= w 2. w 1. B j. A j. C + j1j2

= w 2. w 1. B j. A j. C + j1j2 Local Minima and Plateaus in Multilayer Neural Networks Kenji Fukumizu and Shun-ichi Amari Brain Science Institute, RIKEN Hirosawa 2-, Wako, Saitama 35-098, Japan E-mail: ffuku, amarig@brain.riken.go.jp

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 2012 Engineering Part IIB: Module 4F10 Introduction In

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Tobias Pohlen Selected Topics in Human Language Technology and Pattern Recognition February 10, 2014 Human Language Technology and Pattern Recognition Lehrstuhl für Informatik 6

More information

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability... Functional Analysis Franck Sueur 2018-2019 Contents 1 Metric spaces 1 1.1 Definitions........................................ 1 1.2 Completeness...................................... 3 1.3 Compactness......................................

More information

Learning Kernels -Tutorial Part III: Theoretical Guarantees.

Learning Kernels -Tutorial Part III: Theoretical Guarantees. Learning Kernels -Tutorial Part III: Theoretical Guarantees. Corinna Cortes Google Research corinna@google.com Mehryar Mohri Courant Institute & Google Research mohri@cims.nyu.edu Afshin Rostami UC Berkeley

More information

Vector Space Basics. 1 Abstract Vector Spaces. 1. (commutativity of vector addition) u + v = v + u. 2. (associativity of vector addition)

Vector Space Basics. 1 Abstract Vector Spaces. 1. (commutativity of vector addition) u + v = v + u. 2. (associativity of vector addition) Vector Space Basics (Remark: these notes are highly formal and may be a useful reference to some students however I am also posting Ray Heitmann's notes to Canvas for students interested in a direct computational

More information

12 CHAPTER 1. PRELIMINARIES Lemma 1.3 (Cauchy-Schwarz inequality) Let (; ) be an inner product in < n. Then for all x; y 2 < n we have j(x; y)j (x; x)

12 CHAPTER 1. PRELIMINARIES Lemma 1.3 (Cauchy-Schwarz inequality) Let (; ) be an inner product in < n. Then for all x; y 2 < n we have j(x; y)j (x; x) 1.4. INNER PRODUCTS,VECTOR NORMS, AND MATRIX NORMS 11 The estimate ^ is unbiased, but E(^ 2 ) = n?1 n 2 and is thus biased. An unbiased estimate is ^ 2 = 1 (x i? ^) 2 : n? 1 In x?? we show that the linear

More information

Pointwise convergence rate for nonlinear conservation. Eitan Tadmor and Tao Tang

Pointwise convergence rate for nonlinear conservation. Eitan Tadmor and Tao Tang Pointwise convergence rate for nonlinear conservation laws Eitan Tadmor and Tao Tang Abstract. We introduce a new method to obtain pointwise error estimates for vanishing viscosity and nite dierence approximations

More information

Logistic Regression. COMP 527 Danushka Bollegala

Logistic Regression. COMP 527 Danushka Bollegala Logistic Regression COMP 527 Danushka Bollegala Binary Classification Given an instance x we must classify it to either positive (1) or negative (0) class We can use {1,-1} instead of {1,0} but we will

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Vapnik-Chervonenkis Dimension of Neural Nets

Vapnik-Chervonenkis Dimension of Neural Nets Vapnik-Chervonenkis Dimension of Neural Nets Peter L. Bartlett BIOwulf Technologies and University of California at Berkeley Department of Statistics 367 Evans Hall, CA 94720-3860, USA bartlett@stat.berkeley.edu

More information