On the Value of Partial Information for Learning from Examples

Size: px

Start display at page:

Download "On the Value of Partial Information for Learning from Examples"

Neal Washington
5 years ago
Views:

1 JOURNAL OF COMPLEXITY 13, (1998) ARTICLE NO. CM On the Value of Partial Information for Learning from Examples Joel Ratsaby* Department of Electrical Engineering, Technion, Haifa, Israel an Vitaly Maiorov Department of Mathematics, Technion, Haifa, Israel Receive August 15, 1996 The PAC moel of learning an its extension to real value function classes provies a well-accepte theoretical framework for representing the problem of learning a target function g(x) using a ranom sample {(x i, g(x i ))} m i=1. Base on the uniform strong law of large numbers the PAC moel establishes the sample complexity, i.e., the sample size m which is sufficient for accurately estimating the target function to within high confience. Often, in aition to a ranom sample, some form of prior knowlege is available about the target. It is intuitive that increasing the amount of information shoul have the same effect on the error as increasing the sample size. But quantitatively how oes the rate of error with respect to increasing information compare to the rate of error with increasing sample size? To answer this we consier a new approach base on a combination of information-base complexity of Traub et al. an Vapnik Chervonenkis (VC) theory. In contrast to VC-theory where function classes of finite pseuo-imension are use only for statistical-base estimation, we let such classes play a ual role of functional estimation as well as approximation. This is capture in a newly introuce quantity, ρ (F), which represents a nonlinear with of a function class F. We then exten the notion of the nth minimal raius of information an efine a quantity I n, (F) which measures the minimal approximation error of the worst-case target g F by the family of function classes having pseuo-imension given partial information on g consisting of values taken by n linear operators. The error rates are calculate which leas to a quantitative notion of the value of partial information for the paraigm of learning from examples Acaemic Press * jer@ee.technion.ac.il. All corresponence shoul be maile to this author. maiorov@tx.technion.ac.il X/97 $25.00 Copyright 1998 by Acaemic Press All rights of reprouction in any form reserve.

2 510 RATSABY AND MAIOROV 1. INTRODUCTION The problem of machine learning using ranomly rawn examples has receive in recent years a significant amount of attention while serving as the basis of research in what is known as the fiel of computational learning theory. Valiant [35] introuce a learning moel base on which many interesting theoretical results pertaining to a variety of learning paraigms have been establishe. The theory is base on the pioneering work of Vapnik an Chervonenkis [36 38] on finite sample convergence rates of the uniform strong law of large numbers (SLN) over classes of functions. In its basic form it sets a framework known as the probably approximately correct (PAC) learning moel. In this moel an abstract teacher provies the learner a finite number m of i.i.. examples {(x i, g(x i ))} i=1 m ranomly rawn accoring to an unknown unerlying istribution P over X, where g is the target function to be learnt to some prespecifie arbitrary accuracy ɛ > 0 (with respect to the L 1 (P)-norm) an confience 1 δ, where δ > 0. The learner has at his iscretion a functional class referre to as the hypothesis class from which he is to etermine a function ĥ, sample-epenent, which estimates the unknown target g to within the prespecifie accuracy an confience levels. There have been numerous stuies an applications of this learning framework to ifferent learning problems (Kearns an Vazirani [18], Hanson et al. [15]). The two main variables of interest in this framework are the sample complexity which is the sample size sufficient for guaranteeing the prespecifie performance an the computational complexity of the metho use to prouce the estimator hypothesis ĥ. The bulk of the work in computational learning theory an, similarly, in the classical fiel of pattern recognition, treats the scenario in which the learner has access only to ranomly rawn samples. It is often the case, however, that some aitional knowlege about the target is available through some form of a priori constraints on the target function g. In many areas where machine learning may be applie there is a source of information, sometimes referre to as an oracle or an expert, which supplies ranom examples an even more complex forms of partial information about the target. A few instances of such learning problems inclue: (1) pattern classification. Creit car frau etection where a tree classifier (Devroye et al. [12]) is built from a training sample consisting of patterns of creit car usage in orer to learn to etect transactions that are potentially frauulent. Partial information may be represente by an existing tree which is base on human-expert knowlege. (2) preiction an financial analysis. Financial forecasting an portfolio management where an artificial neural network learns from time-series ata an is given rule-base partial knowlege translate into constraints on the weights of the neuron elements. (3) control an optimization. Learning a control process for inustrial manufacturing

3 LEARNING FROM EXAMPLES 511 where partial information represents quantitative physical constraints on the various machines an their operation. For some specific learning problems the theory preicts that partial knowlege is very significant, for instance, in statistical pattern classification or in ensity estimation, having some knowlege about the unerlying probability istributions may crucially influence the complexity of the learning problem (cf. Devroye [11]). If the istributions are known to be of a certain parametric form an exponentially large savings in sample size may be obtaine (cf. Ratsaby [28], Ratsaby an Venkatesh [30, 31]). In general, partial information may appear as knowlege about certain properties of the target function. In parametricbase estimation or preiction problems, e.g., maximum likelihoo estimation, knowlege concerning the unknown target may appear in terms of a geometric constraint on the Eucliean subset that contains the true unknown parameter. In problems of pattern recognition an statistical regression estimation, often some form of a criterion functional over the hypothesis space is efine. For instance, in artificial neural networks, the wiely use back-propagation algorithm (cf. Ripley [32]) implements a least-square-error criterion efine over a finiteimensional manifol spanne by rige-functions of the form σ(a T x + b), where σ(y) = 1/(1 + e y ). Here prior knowlege can take the form of a constraint ae on to the minimization of the criterion. In Section 3 we provie further examples where partial information is use in practice. It is intuitive that general forms of prior partial knowlege about the target an ranom sample ata are both useful. PAC provies the complexity of learning in terms of the sample sizes that are sufficient to obtain accurate estimation of g. Our motive in this paper is to stuy the complexity of learning from examples while being given prior partial information about the target. We seek the value of partial information in the PAC learning paraigm. The approach taken here is base on combining frameworks of two fiels in computer science, the first being information-base complexity (cf. Traub et al. [34]) which provies a representation of partial information while the secon, computational learning theory, furnishes the framework for learning from ranom samples. The remainer of this paper is organize as follows: In Section 2 we briefly review the PAC learning moel an Vapnik Chervonenkis theory. In Section 3 we provie motivation for the work. In Section 4 we introuce a new approximation with which measures the egree of nonlinear approximation of a functional class. It joins elementary concepts from Vapnik Chervonenkis theory an classical approximation theory. In Section 5 we briefly review some of the efinitions of information-base complexity an then introuce the minimal information-error I n, ( ). In Section 6 we combine the PAC learning error with the minimal partial information error to obtain a unifie upper boun on the error. In Section 7 we compute this upper boun for the case of learning a Sobolev target class. This yiels a quantitative trae-off between partial information an

4 512 RATSABY AND MAIOROV sample size. We then compute a lower boun on the minimal partial information error for the Sobolev class which yiels an almost optimal information operator. The Appenix inclues the proofs of all theorems in the paper. 2. OVERVIEW OF THE PROBABLY APPROXIMATELY CORRECT LEARNING MODEL Valiant [35] introuce a new complexity-base moel of learning from examples an illustrate this moel for problems of learning inicator functions over the boolean cube {0, 1} n. The moel is base on a probabilistic framework which has become known as the probably approximately correct, or PAC, moel of learning. Blumer et al. [6] extene this basic PAC moel to learning inicator functions of sets in Eucliean n. Their methos are base on the pioneering work of Vapnik an Chervonenkis [36] on finite sample convergence rates of empirical probability estimates, inepenent of the unerlying probability istribution. Haussler [16] has further extene the PAC moel to real an vectorvalue functions which is applicable to general statistical regression, ensity estimation an classification learning problems. We start with a escription of the basic PAC moel an some of the relevant results concerning the complexity of learning. A target class is a class of Borel measurable functions over a omain X containing a target function g which is to be learnt from a sample z m = {(x i, g(x i ))} i=1 m of m examples that are ranomly rawn i.i.. accoring to any fixe probability istribution P on X. Define by S the sample space for which is the set of all samples of size m over all functions f for all m 1. Fix a hypothesis class of functions on X which nee not be equal nor containe in. Alearning algorithm φ: S is a function that, given a large enough ranomly rawn sample of any target in, returns a Borel measurable function h (a hypothesis) which is with high probability a goo approximation of the target function g. Associate with each hypothesis h, is a nonnegative error value L(h), which measures its isagreement with the target function g on a ranomly rawn example an an empirical error L m (h), which measures the isagreement of h with g average over the observe m examples. Note that the notation of L(h) an L m (h) leaves the epenence on g an P implicit. For the special case of an being classes of inicator functions over sets of X = n the error of a hypothesis h is efine to be the probability (accoring to P) of its symmetric ifference with the target g; i.e., L(h) = P({x n : g(x) h(x)}). (1)

5 LEARNING FROM EXAMPLES 513 Corresponingly, the empirical error of h is efine as L m (h) = 1 m m 1 {g(xi ) h(x i )}, (2) i=1 where 1 {x A} stans for the inicator function of the set A. Forreal-value function classes an the error of a hypothesis h is taken as the expectation El(h, g) (with respect to P) of some positive real-value loss function l(h, g), e.g., quaratic loss l(h, g) =(h(x) g(x)) 2 in regression estimation, or the log likelihoo loss l(h, g) = ln(g(x)/h(x)) for ensity estimation. Similarly, the empirical error now becomes the average loss over the sample, i.e., L m (h) = (1/m) m i=1 l(h(x i ), g(x i )). We now state a formal efinition of a learning algorithm which is an extension of a efinition in Blumer et al. [6]. DEFINITION 1 (PAC-learning algorithm). Fix a target class, a hypothesis class, a loss function l(, ), an any probability istribution P on X. Denote by P m the m-fol joint probability istribution on X m. A function φ is a learning algorithm for with respect to P with sample size m m(ɛ, δ) if for all ɛ >0, 0<δ < 1, for any fixe target g, with probability 1 δ, base on a ranomly rawn sample z m, the hypothesis ĥ = φ(z m ) has an error L(ĥ) L(h ) + ɛ, where h * is an optimal hypothesis; i.e., L(h ) = inf h L(h). Formally, this is state as: P m (z m X m : L(ĥ) >L(h ) + ɛ) δ. The smallest sample size m(ɛ, δ) such that there exists a learning algorithm φ for with respect to all probability istributions is calle the sample complexity of φ or simply the sample complexity for learning by. If such a φ exists then is sai to be uniformly learnable by. We note that in the case of realvalue function classes the sample complexity epens on the error function through the particular loss function use. Algorithms φ which output a hypothesis ĥ that minimizes L m (h) over all h are calle empirical risk minimization (ERM) algorithms (cf. Vapnik [38]). The theory of uniform learnability for ERM algorithms forms the basis for the majority of the works in the fiel of computational learning theory, primarily for the reason that the sample complexity is irectly relate to a capacity quantity calle the Vapnik Chervonenkis imension of for the case of an inicator function class, or to the pseuo-imension in case of a realvalue function class. These two quantities are efine an iscusse below. Essentially the theory says that if the capacity of is finite then is uniformally learnable. We note that there are some peagogic instances of functional classes, even of infinite pseuo-imension, for which any target function can be exactly learnt by a single example of the form (x, g(x)) (cf. Bartlett et al., p. 299). For such target classes the sample complexity of learning by ERM is significantly greater than one so ERM is not an efficient form of learning. Henceforth all the results are limite to ERM learning algorithms.

6 514 RATSABY AND MAIOROV We start with the following efinition. DEFINITION 2 (Vapnik Chervonenkis imension). Given a class of inicator functions of sets in X the Vapnik Chervonenkis imension of, enote as VC( ), is efine as the largest integer m such that there exists a sample x m ={x 1,...,x m } of points in X such that the carinality of the set of boolean vectors S x m ( ) ={[h(x 1 ),..., h(x m )]: h } satisfies S x m ( ) =2 m.ifm is arbitrarily large then the VC-imension of is infinite. Remark. The quantity max x m S x m ( ), where the maximum is taken over all possible m-samples, is calle the growth function of. EXAMPLE. Let be the class of inicator functions of interval sets on X =. With a single point x 1 X we have {[h(x 1 )]: h } = 2. For two points x 1, x 2 X we have {[h(x 1 ), h(x 2 )]: h } = 4. When m = 3, for any points x 1, x 2, x 3 X we have {[h(x 1 ), h(x 2 ), h(x 3 )]: h } < 2 3 thus VC( ) =2. The main interest in the VC-imension quantity is ue to the following result on a uniform strong law of large numbers which is a variant of Theorem 6.7 in Vapnik [38]. LEMMA 1 (Uniform SLN for the inicator function class). Let g be any fixe target inicator function an let be a class of inicator functions of sets in X with VC( ) = <. Let z m ={(x i, g(x i ))} i=1 m be a sample of size m > consisting of ranomly rawn examples accoring to any fixe probability istribution P on X. Let L m (h) enote the empirical error for h base on z m an g as efine in (2). Then for arbitrary confience parameter 0 <δ<1, the eviation between the empirical error an the true error uniformly over is boune as sup h L(h) L m (h) 4 (ln(2m/) + 1) + ln(9/δ) m with probability 1 δ. Remark. The result actually hols more generally for a boolean ranom variable y Y = {0, 1} replacing the eterministic target function g(x). In such a case the sample consists of ranom pairs {(x i, y i )} i=1 m istribute accoring to any fixe joint probability istribution P over X Y. Thus a function class of finite VC-imension possesses a certain statistical smoothness property which permits simultaneous error estimation over all hypotheses in using the empirical error estimate. We note in passing that there is an interesting generalization (cf. Buescher an Kumar [7], Devroye et al. [12]) of the empirical error estimate to other smooth estimators base on the iea of empirical coverings which removes the conition of neeing a finite VC-imension.

7 LEARNING FROM EXAMPLES 515 As a irect consequence of Lemma 1 we obtain the necessary an sufficient conitions for a target class of inicator functions to be uniformly learnable by a hypothesis class. This is state next an is a slight variation of Theorem 2.1 in Blumer et al. [6]. LEMMA 2 (Uniform learnability of inicator function class). Let an be a target class an a hypothesis class, respectively, of inicator functions of sets in X. Then is uniformly learnable by if an only if the VC( )<. Moreover, if VC( ) =, where <, then for any 0 <ɛ,δ<1, the sample complexity of an algorithm φ is boune from above by c((/ɛ) log(1/δ)), for some absolute constant c > 0. We procee now to the case of real-value functions. The next efinition which generalizes the VC-imension is taken from Haussler [16] an is base on the work of Pollar [27]. Let sgn(y) be efine as 1 for y > 0 an 1 for y 0. For a Eucliean vector v m enote by sgn(v) = [sgn(v 1 ),..., sgn(v m )]. DEFINITION 3 (Pseuo-imension). Given a class of real-value functions efine on X. The pseuo-imension of, enote as im p ( ), is efine as the largest integer m such that there exists {x 1,...,x m } X an a vector v m such that the carinality of the set of boolean vectors satisfies {sgn[h(x 1 ) + v 1,...,h(x m ) + v m ]: h } = 2 m.ifm is arbitrarily large then the im p ( ) =. The next lemma appears as Theorem 4 in Haussler [16] an states that for the case of finite-imensional vector spaces of functions the pseuo-imension equals its imension. LEMMA 3. Let be a -imensional vector space of functions from a set X into. Then im p ( ) =. For several useful invariance properties of the pseuo-imension cf. Pollar [27] an Haussler [16, Theorem 5]. The main interest in the pseuo-imension arises from having the SLN hol uniformly over a real-value function class if it has a finite pseuo-imension. In orer to apply this to the PAC-framework we nee a uniform SLN result not for the hypothesis class but for a class efine by ={l(h(x), y): h, x X, y } for some fixe loss function l, since an ERM-base algorithm minimizes the empirical error, i.e., L m (h), over. While the theory presente in this paper applies to general loss functions we restrict here to the absoluteloss l(h(x), g(x)) = h(x) g(x). The next lemma is a variant of Theorem 7.3 of Vapnik [38]. THEOREM 1. Let P be any probability istribution on X an let g be a fixe target function. Let be a class of functions from X to which has a pseuo-imension 1 an for any h enote by L(h) = E h(x) g(x) an assume L(h) M for some absolute constant M > 0. Let {(x i, g(x i ))} i=1 m,

8 516 RATSABY AND MAIOROV x i X, be an i.i.. sample of size m > 16( + 1) log 2 4( + 1) rawn accoring to P. Then for arbitrary 0 <δ<1, simultaneously for every function h, the inequality 16( + 1) log 2 4( + 1)(ln(2m) + 1) + ln(9/δ) L(h) L m (h) 4M (3) m hols with probability 1 δ. The theorem is prove in Section A.1. Remark. For uniform SLN results base on other loss functions see Theorem 8 of Haussler [16]. We may take twice the right-han sie of (3) to be boune from above by the simpler expression log 2 ln m + ln(1/δ) ɛ(m,, δ) c 1 (4) m for some absolute constant c 1 > 0. Being that an ERM algorithm picks a hypothesis ĥ whose empirical error satisfies L m (ĥ) = inf h L m (h) an by Definition 1, L(h ) = inf h L(h), it follows that ɛ(m,, δ) L(ĥ) L m (ĥ) + 2 L m (h ɛ(m,, δ) ) + 2 L(h ) + ɛ(m,, δ). (5) By (5) an accoring to Definition 1 it is immeiate that ERM may be consiere as a PAC learning algorithm for. Thus we have the following lemma concerning the sufficient conition for uniform learnability of a realvalue function class. LEMMA 4 (Uniform learnability of real-value function class). Let an be the target an hypothesis classes of real-value functions, respectively, an let P be any fixe probability istribution on X. Let the loss function l(g(x), h(x)) = g(x) h(x) an assume L(h) M for all h, an g, for some absolute constant M > 0. Ifim p ( ) < then is uniformly learnable by. Moreover, if im p ( ) = < then for any ɛ > 0, 0 < δ < 1, the sample complexity of learning by is boune from above by (cm 2 ln 2 ()/ɛ 2 )(ln(m/ɛ) + ln(1/δ)), for some absolute constant c > 0. Remarks. As in the last remark above, this result can be extene to other loss functions l. In aition, Alon et al. [4] recently showe that a quantity calle

9 LEARNING FROM EXAMPLES 517 the scale-sensitive imension which is a generalization of the pseuo-imension, etermines the necessary an sufficient conition for uniform learnability. It is also worth noting that there have been several works relate to the pseuo-imension but which are use for mathematical analysis other than learning theory. As far as we are aware, Warren [39] was the earliest who consiere a quantity calle the number of connecte components of a nonlinear manifol of real-value functions, which closely resembles the growth function of Vapnik an Chervonenkis for set-inicator functions, see Definition 2. Using this he etermine lower bouns on the egree of approximation by certain nonlinear manifols. Maiorov [20] calculate this quantity an etermine the egree of approximation for the nonlinear manifol of rige functions which inclue the manifol of functions represente by artificial neural networks with one hien layer. Maiorov, Meir, an Ratsaby [21], extene his result to the egree of approximation measure by a probabilistic (n, δ)-with with respect to a uniform measure over the target class an etermine finite sample complexity bouns for moel selection using neural networks [29]. For more works concerning probabilistic withs of classes see Traub et al. [34], Maiorov an Wasilkowski [22]. Throughout the remainer of the paper we will eal with learning real-value functions while enoting explicitly a hypothesis class as one which has im p ( ) =. For any probability istribution P an target function g, the error an empirical error of a hypothesis h are efine by the L 1 (P)-metric as L(h) = E h(x) g(x), L m (h) = 1 m m h(x i ) g(x i ), (6) i=1 respectively. We iscuss next some practical motivation for our work. 3. MOTIVATION FOR A THEORY OF LEARNING WITH PARTIAL INFORMATION It was mentione in Section 1 that the notion of having partial knowlege about a solution to a problem, or more specifically about a target function, is often encountere in practice. Starting from the most elementary instances of learning in humans it is almost always the case that a learner begins with some partial information about the problem. For instance, in learning cancer iagnosis, a teacher not only provies examples of pictures of healthy cells an benign cells but also escriptive partial information such as a benign cell has color black an elongate shape, or benign cells usually appear in clusters. Similarly, for machine learning it is intuitive that partial information must be useful.

10 518 RATSABY AND MAIOROV While much of the classical theory of pattern recognition (Dua an Hart [13], Fukunaga [14]) an the more recent theory of computational learning (Kearns an Vazirani [18]) an neural networks (Ripley [32]) focus on learning from ranomnly rawn ata, there has been an emergence of interest in nonclassical forms of learning, some of which inicates that partial information in various forms which epen on the specific application is useful in practice. This is relate to the substream known as active learning, where the learner participates actively by various forms of querying to obtain information from the teacher. For instance, the notion of selective sampling (cf. Cohn et al. [8]) permits the learner to query for samples from omain-regions having high classification uncertainty. Cohn [9] uses methos base on the theory of optimal experiment esign to select ata in an on-line fashion with the aim of ecreasing the variance of an estimate. Abu-Mostafa [1 3] refers to partial information as hints an consiers them for financial preiction problems. He shows that certain types of hints which reflect invariance properties of the target function g, for instance saying that g(x) =g(x ), at some points x, x in the omain, may be incorporate into a learning error criterion. In this paper we aopt the framework of information-base complexity (cf. Traub et al. [34]) to represent partial information. In the framework whose basic efinitions are reviewe in Section 5, we limit to linear information comprise of n linear functionals L i (g), 1 i n, operating on the target function g. In orer to motivate the interest in partial information as being given by such n- imensional linear operators we give the following example of learning pattern classification using a classical nonparametric iscriminant analysis metho (cf. Fukunaga [14]). The fiel of pattern recognition treats a wie range of practical problems where an accurate ecision is to be mae concerning a stochastic pattern which is in the form of a multiimensional vector of features of an unerlying stochastic information source, for instance, eciing which of a finite number of types of stars correspons to given image ata taken by an exploratory spacecraft, or eciing which of the wors in a finite ictionary correspon to given speech ata which consist of spectral analysis information on a soun signal. Such problems have been classically moele accoring to a statistical framework where the input ata are stochastic an are represente as ranom variables with a probability istribution over the ata space. The most wiely use criterion for learning pattern recognition (or classification) is the misclassification probability on ranomly chosen ata which have not been seen uring the training stage of learning. In orer to ensure an accurate ecision it is necessary to minimize this criterion. The optimal ecision rule is one which achieves the minimum possible misclassification probability an has been classically referre to as Baye s ecision rule. We now consier an example of learning pattern recognition using ranomly rawn examples, where partial information takes the form of feature extraction.

11 LEARNING FROM EXAMPLES 519 EXAMPLE (Learning pattern classification). The setting consists of M pattern classes represente by unknown nonparametric class conitional probability ensity functions f (x j) over X = l with known correpsoning a priori class probabilities p j, 1 j M. It is well known that the optimal Bayes classifier which has the minimal misclassification probability is efine as follows: g(x) = argmax 1 j M {p j f (x j)}, where argmax j A B j enotes any element j in A such that B j B i, j i. Its misclassification probability is calle the Bayes error. For instance, suppose that M = 2 an f (x/j), j =1,2,are both l-imensional Gaussian probability ensity functions. Here the two pattern classes clearly overlap as their corresponing functions f (x 1) an f (x 2) have an overlapping probability-1 support; thus the optimal Bayes misclassification probability must be greater than zero. The Bayes classifier in this case is an inicator function over a set A ={x l : q(x) >0}, where q(x) is a secon egree polynomial over l. We henceforth let the target function, enote by g(x), be the Bayes classifier an note that it may not be unique. The target class is efine as a rich class of classifiers each of which maps X to {1,...,M}. The training sample consists of m i.i.. pairs {(x i, y i )} i=1 m, where y i {1, 2,...,M} takes the value j with probability p j, an x i is rawn accoring to the probability istribution corresponing to f (x y i ),1 i m. The learner has a hypothesis class of classifier functions mapping X to {1,...,M} which has a finite pseuo-imension. Formally, the learning problem is to approximate g by a hypothesis h in. The error of h is efine as L(h) = h g L1 (P), where P is some fixe probability istribution over. State in the PAC-framework, a target class is to be uniformly learne by ; i.e., for any fixe target g an any probability istribution P on X, fin an ĥ which epens on g an whose error L(ĥ) L(h ) + ɛ with probability 1 δ, where L(h ) = inf h g h L1 (P). As partial information consier the ubiquitous metho of feature extraction which is escribe next. In the pattern classification paraigm it is often the case that, base on a given sample {(x i, y i )} i=1 m which consists of feature vectors x i l,1 i m, one obtains a hypothesis classifer ĥ which incurs a large misclassification probability. A natural remey in such situations is to try to improve the set of features by generating a new feature vector y Y = k, k l, which epens on x, with the aim of fining a better representation for a pattern which leas to larger separation between the ifferent pattern-classes. This in turn leas to a simpler classifier g which can now be better approximate by a hypothesis h in the same class of pseuo-imension, the latter having not been rich enough before for approximating the original target g. Consequently with the same sample complexity one obtains via ERM a hypothesis ĥ which estimates g better an therefore having a misclassification probability closer to the optimal Bayes misclassification probability. Restricting to linear mappings A: X Y, classical iscriminant analysis methos (cf. Fukunaga [14, Section 9.2]; Dua an Hart [13, Chap. 4]) calculate

12 520 RATSABY AND MAIOROV the optimal new feature vector y by etermining the best linear map A* which, accoring to one of the wiely use criteria, maximizes the pattern class separability. Such criteria are efine by the known class probabilities p j, the class conitional means µ j = E(X j), an the class conitional covariance matrices C j = E((X µ j )(X µ j ) T j), 1 j M, where expectation E( j) is taken with respect to the jth class conitional probability istribution corresponing to f (x j). In reality the empirical average over the sample is use instea of taking expectation, since the unerlying probability istributions corresponing to f (x j), 1 j M, are unknown. Theoretically, the quantities µ j, C j, may be viewe as partial inirect information about the target Bayes classifier g. Such information can be represente by an n-imensional vector of linear functionals acting on f (x j), 1 j M, i.e., N([ f (x 1),..., f (x M)]) =[{µ j, s } M j=1,l s=1, {σ s, j r } M j=1,l s r=1 ], where µ j, s = X x s f (x j) x, an σs, j r = X x sx r f (x j) x, where x r, x s,1 r, s l, are elements of x. The imensionality of the information vector is n =(Ml/2)(l + 3). We have so far presente the theory for learning from examples an introuce the importance of partial information from a practical perspective. Before we procee with a theoretical treatment of learning with partial information we igress momentarily to introuce a new quant ity which is efine in the context of the mathematical fiel of approximation theory which plays an important part in our learning framework. 4. A NEW NONLINEAR APPROXIMATION WIDTH The large mathematical fiel of approximation theory is primarily involve in problems of existence, uniqueness, an characterization of the best approximation to elements of a norme linear space by various types of finiteimensional subspaces n of (cf. Pinkus [25]). Approximation of an element f is measure by the istance of the finite-imensional subspace n to f where istance is usually efine as inf g n f g, where throughout this iscussion is any well-efine norm over. The egree of approximation of a subset (possibly a nonlinear manifol) F by n is efine by the istance between F an n which is usually taken as sup f F inf g n f g. The Kolmogorov n-with is the classical istance efinition when one allows the approximating set n to vary over all possible linear subspaces of.itis efine as K n (F; ) = inf n sup f F inf g n f g. This efinition leas to the notion of the best approximating subspace n, i.e., the one whose istance from F equals K n (F; ). While linear approximation, e.g., using finite imensional subspaces of polynomials, is important an useful, there are many known spaces which can be approximate better by nonlinear subspaces, for instance, by the span of

13 LEARNING FROM EXAMPLES 521 a neural-network basis ={h(x) = n i=1 c i σ(wi T x b i ): w i l, c i, b i, 1 i n}, where σ(y) = 1/(1+e y ). In this brief overview we will follow the notation an efinitions of Devore [10]. Let M n be a mapping from n into the Banach space which associates each a n the element M n (a). Functions f are approximate by functions in the manifol n = {M n (a): a n }. The measure of approximation of f by n is naturally efine as the istance inf a n f M n (a). As above, the egree of approximation of a subset F of by n is efine as sup f F inf a n f M n (a). In analogy to the Kolmogorov n-with, it woul be tempting to efine the optimal approximation error of F by manifols of finite imension n as inf sup n f F inf a n f M n(a). However, as pointe out in [10], this with is zero for all subsets F in every separable class. To see this, consier the following example which escribes a space filling manifol: let { f k } k= be ense in an efine M 1 (a) = (a k) f k+1 + (k + 1 a) f k for k a k +1. The mapping M 1 : 1, is continuous with a corresponing one-imensional manifol 1 satisfying sup f F inf a 1 f M 1 (a) =0. Thus this measure of with of F is not natural. One possible alternative use in approximation theory is to impose a smoothness constraint on the nonlinear manifols n that are allowe in the outermost infimum. However, this exclues some interesting manifols such as splines with free knots. A more useful constraint is to limit the selection operator r, which takes an element f F to n, to be continuous. Given such operator r then the approximation of f by a manifol n is M n (r( f )). The istance between the set F an the manifol n is then efine as sup f F f M n (r( f )). The continuous nonlinear n-with of F is then efine as D n (F; ) = inf r: cont., sup n f F f M n(r( f )), where the infimum is taken over all continuous selection operators r an all manifols n. This with is consiere by Alexanrov [33] an Devore [10] an is etermine for various F an in [10]. The Alexanrov nonlinear with oes not in general reflect the egree of approximation of the more natural selection operator r which chooses the best approximation for an f F as its closest element in n, i.e., that whose istance from f equals inf g n f g, the reason being that such r is not necessarily continuous. In this paper we consier an interesting alternate efinition for a nonlinear with of a function class which oes not have this eficiency. Base on the pseuo-imension (Definition 3 in Section 2) we efine the nonlinear with ρ (F) inf inf f h, (7) h sup f F where runs over all classes (not necessarily in ) having pseuoimension. Now the natural selection operator is use, namely, the one which approximates f by an element h( f ), where f h( f ) =inf h f h. The

14 522 RATSABY AND MAIOROV constraint of using finite pseuo-imensional approximation manifols allows ropping the smoothness constraint on the manifol an the continuity constraint on the selection operator. The with ρ expresses the ability of manifols to approximate accoring to their pseuo-imension as oppose to their imensionality as in some of the classical withs. The reason that ρ is interesting from a learning theoretic aspect is that the constraint on the approximation manifol involves the pseuoimension im p ( ) which was shown in Section 2 to have a irect effect on uniform learnability, namely, a finite pseuo-imension guarantees consistent estimation. Thus ρ involves two inepenent mathematical notions, namely, the approximation ability an the statistical estimation ability of. As will be shown in the next sections, joining both notions in one quantity enables us to quantify the trae-off between information an sample complexity as applie to the learning paraigm. We halt the iscussion about ρ an refer the intereste reaer to [23] where we estimate it for a stanar Sobolev class Wp r, l,1 p, q. 5. THE MINIMAL PARTIAL INFORMATION ERROR In this section we review some basic concepts in the fiel of informationbase complexity an then exten these to efine a new quantity calle the minimal partial information error which is later use in the learning framework. Throughout this section, enotes any function norm an the istance between two function classes an is enote as ist(,, L q ) = sup a inf b a b Lq, q 1. The following formulation of partial information is taken from Traub et al. [34]. While we limit here to the case of approximating functions f we note that the theory is suitable for problems of approximating general functionals S( f ). Let N n : N n ( ) n enote a general information operator. The information N n (g) consists of n measurements taken on the target function g, or in general, any function f ; i.e., N n ( f ) = [L 1 ( f ),..., L n ( f )] where L i,1 i n, enote any functionals. We call n the carinality of information an we sometimes omit n an write N( f ). The variable y enotes an element in N n ( ). The subset Nn 1 (y) enotes all functions f which share the same information vector y, i.e., Nn 1 (y) ={f : N n( f ) = y}. We enote by Nn 1(N n(g)) the solution set which may also be written as { f : N n ( f ) = N n (g)}, which consists of all inistinguishable functions

15 LEARNING FROM EXAMPLES 523 f having the same information vector as the target g. Giveny n,itis assume that a single element enote as g y Nn 1 (y) can be constructe. In this moel information effectively partitions the target class into infinitely many subsets Nn 1(y), y n, each having a single representative g y which forms the approximation for any f N 1 (y). Denote the raius of N 1 (y) by r(n, y) = inf f sup f f (8) f N 1 (y) an call it the local raius of information N at y. The global raius of information N at y is efine as the local raius for a worst y, i.e., r(n) = sup r(n, y). y N( ) This quantity measures the intrinsic uncertainty or error which is associate with a fixe information operator N. Note that in both of these efinitions the epenence on is implicit. Let 3 be a family of functionals an consier the family 3 n which consists of all information N = [L 1,...,L k ] of carinality k n with L i 3, 1 i n. Then r(n, )= inf r(n) N n is calle the nth minimal raius of information in the family 3 an Nn = [L 1,...,L n ] is calle the nth optimal information in the class 3 iff L i an r(nn ) = r(n, ). When 3 is the family of all linear functionals then r(n, 3) becomes a slight generality of the well-known Gelfan-with of the class whose classical efinition is n ( ) = inf A n sup f A n f, where A n is any linear subspace of coimension n. In this paper we restrict to the family 3 of linear functionals an for notational simplicity we will henceforth take the information space N n ( ) = n. As alreay mentione in the efinition of r(n, y) there is a single element g y not necessarily in N 1 (y) which is selecte as an approximator for all functions f N 1 (y). Such a efinition is useful for the problem of information-base complexity since all that one is concerne with is to prouce an ɛ-approximation base on information alone. In the PAC framework, however, a major significance is place on proviing an approximator to a target g which is an element not necessarily of the target class but of some hypothesis class of finite pseuo imension by which is uniformly learnable. We therefore replace the single-representative of the subset N 1 (y) by a whole approximation class of functions y of pseuo-imension. Note that now information alone oes not point to a single ɛ-approximation element, but rather to a manifol y, possibly nonlinear, which for any f N 1 (y), in particular the target g, contains an element h *, epenent on g, such that the

16 524 RATSABY AND MAIOROV istance g h * ɛ. Having a pseuo-imension implies that with a finite ranom sample {(x i, g(x i ))} i=1 m, an ERM learning algorithm (after being shown partial information an hence pointe to the class y ) can etermine a function ĥ y whose istance from g is no farther than ɛ from the istance between h * an g with confience 1 δ. Thus base on n units of information about g an m labele examples {(x i, g(x i ))} i=1 m, an element ĥ can be foun such that g ĥ 2ɛ with probability 1 δ. The sample complexity m oes not epen on the type of hypothesis class but only on its pseuo-imension. Thus the above construction is true for any hypothesis class (or manifol) of pseuo-imension. Hence we may permit any hypothesis class of pseuo-imension to play the role of the approximation manifol y of the subset N 1 (y). This amounts to replacing the infimum in the efinition (8) of r(n, y) byinf an replacing f f by ist( f, ) = inf h f h, yieling the quantity ρ (N 1 (y)) as a new efinition for a local raius an a new quantity I n, ( ) (to be efine later) which replaces r(n, 3). We next formalize these ieas through a sequence of efinitions. We use ρ (K, L q ) to explicitly enote the norm L q use in the efinition of (7). We now efine three optimal quantities, Nn, N, an h *, all of which implicitly epen on n the unknown istribution P while h * epens also on the unknown target g. DEFINITION 4. Let the optimal linear information operator Nn of carinality n be one which minimizes the approximation error of the solution set Nn 1(y) (in the worst case over y n ) over all linear operators N n of carinality n an manifols of pseuo-imension. Formally, it is efine as one which satisfies sup y ρ (N 1 n n (y), L 1 (P)) = inf N n sup y ρ (N 1 n n (y), L 1(P)). DEFINITION 5. For a fixe optimal linear information operator Nn of carinality n efine the optimal hypothesis class y of pseuo-imension (which epens implicitly on Nn through y) as one which minimizes the approximation error of the solution set Nn 1 (y) over all manifols of pseuoimension. Formally, it is efine as one which satisfies ist(nn 1 (y), y, L 1 (P)) = ρ (Nn 1 (y), L 1 (P)). DEFINITION 6. For a fixe target g, optimal linear information operator Nn an optimal hypothesis class N n (g) efine the optimal hypothesis h to be any function which minimizes the error over N n (g), namely, N n (g) L(h ) = inf L(h). (9) h N n (g)

17 LEARNING FROM EXAMPLES 525 As mentione earlier, the main motive of the paper is to compute the value of partial information for learning in the PAC sense. We will assume that the teacher has access to unlimite (linear) information which is represente by him knowing the optimal linear information operator Nn an optimal hypothesis class y for every y n. Thus in this ieal setting proviing partial information amounts to pointing to the optimal hypothesis class N n (g) which contains an optimal hypothesis h *. We again note that information alone oes not point to h * but it is the role of learning from examples to complete the process through estimating h * using a hypothesis ĥ. The error of h * is important in its own right. It represents the minimal error for learning a particular target g given optimal information of carinality n. In line with the notion of uniform learnability (see Section 2) we efine a variant of this optimal quantity which is inepenent of the target g an probability istribution P; i.e., instea of a specific target g, we consier the worst target in an we use the L norm for approximation. This yiels the following efinition. DEFINITION 7 (Minimal partial information error). an any integers n, 1, let For any target class I n, ( ) inf N n sup y ρ (N 1 n n (y), L ), where N n runs over all linear information operators. I n, ( ) represents the minimal error for learning the worst-case target in the PAC sense (i.e., assuming an unknown unerlying probability istribution) while given optimal information of carinality n an using an optimal hypothesis class of pseuo-imension. We procee next to unify the theory of Section 2 with the concepts introuce in the current section. 6. LEARNING FROM EXAMPLES WITH OPTIMAL PARTIAL INFORMATION In Section 2 we reviewe the notion of uniform learnability of a target class by a hypothesis class of pseuo-imension <. By minimizing an empirical error base on the ranom sample, a learner obtains a hypothesis ĥ which provies a close approximation of the optimal hypothesis h * to within ɛ accuracy with confience 1 δ. Suppose that prior to learning the learner obtains optimal information Nn (g) about g. This effectively points the learner to a class N n (g) which contains a hypothesis h * as efine in (9). The error of h * is boune from above as

18 526 RATSABY AND MAIOROV L(h ) = inf L(h) (10) h Nn (g) = inf g h L1 h (P) (11) Nn (g) sup { f : N n ( f )=N n (g)} inf f h L1 h (P) (12) N n (g) = ist ( N 1 n (N n (g)), N n (g), L 1(P) ). (13) By Definition 5 this equals ρ (Nn 1 (Nn (g)), L 1(P)) an is boune from above by The latter equals ( sup ρ N 1 y n n (y), L 1 (P) ). inf N n sup y ρ (N 1 n n (y), L 1(P)) by Definition 4. This is boune from above by inf Nn sup y n ρ (Nn 1(y), L ) which from Definition 7 is I n, ( ). Subsequently, the teacher provies m i.i.. examples {(x i, g(x i ))} i=1 m ranomly rawn accoring to any probability istribution P on X. Arme with prior knowlege an a ranom sample the learner then minimizes the empirical error L m (h) over all h Nn (g), yieling an estimate ĥ of h *. We may break up the error L(ĥ) into a learning error an a minimal partial information error components ( ) L(ĥ) = L(ĥ) L(h ) + L(h ) learning error minimal partial information error {}}{{}}{ ɛ(m,, δ) + I n, ( ), (14) where the learning error, efine in (4), measures the extra error incurre by using ĥ as oppose to the optimal hypothesis h *. The important ifference from the PAC moel can be seen in comparing the upper boun of (14) with that of (5). The former epens not only on the sample size m an pseuo-imension but also on the amount n of partial information. To see how m, n, an influence the performance, i.e., the error of ĥ, we will next particularize to a specific target class. 7. SOBOLEV TARGET CLASS The preceing theory is now applie to the problem of learning a target in a Sobolev class = W r, l (M), for r, l +, M > 0, which is efine as all

19 LEARNING FROM EXAMPLES 527 functions over X = [0, 1] l having all partial erivatives up to orer r boune in the L norm by M. Formally, let k = [k 1,...,k l ] l +, k = l i=1 k i, an enote by D k f = ( k 1 + +k l )/( x k x k l l ) f, then W r, l (M) ={f : sup D k f (x) M, k r} x [0, 1] t which henceforth is referre to as W r, l or. We now state the main results an their implications. THEOREM 2. Let = W r, l, n 1, 1, be given integers an c 2 > 0 a constant inepenent of n an. Then I n, ( ) c 2 (n + ) r/l. The proof of the theorem is in Section A.2. THEOREM 3. Let the target class = W r, l an g be the unknown target function. Given an i.i.. ranom sample {(x i, g(x i ))} i=1 m of size m rawn accoring to any unknown istribution P on X. Given an optimal partial information vector Nn (g) consisting of n linear operations on g. For any 1, let N n (g) be the optimal hypothesis class of pseuo-imension. Let ĥ be the output hypothesis obtaine from running empirical error minimization over N n (g). Then for an arbitrary 0 <δ<1, the error of ĥ is boune as log 2 ln m + ln(1/δ) c 2 L(ĥ) c 1 +, (15) m (n + ) r/l where c 1, c 2 > 0 are constants inepenent of m, n, an. The proof of Theorem 3 is base on Theorem 1 an Theorem 2, both of which are prove in the Appenix. We now iscuss several epenences an trae-offs between the three complexity variables m, n, an. First, for a fixe sample size m an fixe information carinality n there is an optimal class complexity ( { } rm 2l/(l+2r) c 3 l n), (16) ln m which minimizes the upper boun on the error where c 3 > 0 is an absolute constant. The complexity is a free parameter in our learning setting an is proportional to the egree in which the estimator ĥ fits the ata while estimating the optimal hypothesis h *. The result suggests that for a given sample size

20 528 RATSABY AND MAIOROV m an partial information carinality n, there is an optimal estimator (or moel) complexity * which minimizes the error rate. Thus if a structure of hypothesis classes { } =1 is available in the learning problem, then base on fixe m an n the best choice of a hypothesis class over which the learner shoul run empirical error minimization is with * as in (16). The notion of having an optimal complexity * is closely relate to statistical moel selection (cf. Linhart an Zucchini [19], Devroye et al. [12], Ratsaby et al. [29]). For instance, in Vapnik s structural risk minimization criterion (SRM) [38] the trae-off is between m an. For a fixe m, it is possible to calculate the optimal complexity * of a hypothesis class in a neste class structure, , by minimizing an upper boun on the error L(ĥ) L m (ĥ) + ɛ(m,, δ), over all 1. The secon term ɛ(m,, δ) is commonly referre to as the penalty for ata-overfitting which one wants to balance against the empirical error. Similarly, in our result, the upper boun on the learning error reflects the cost or penalty of overfitting the ata the larger, the higher the egree of ata fit an the larger the penalty. However, here, as oppose to SRM, the boun is inepenent of the ranom sample an there is an extra parameter n that affects how m an trae off. As seen from (16), for a fixe sample size m it follows that the larger n the smaller *. This is intuitive since the more partial information, the smaller the solution set Nn 1(N n(g)) an the lower the complexity of a hypothesis class neee to approximate it. Consequently, the optimal estimator ĥ belongs to a simpler hypothesis class an oes not overfit the ata as much. We next compute the trae-off between n an m. Assuming is fixe (not necessarily at the optimal value * ) an fixing the total available information an sample size, m + n, at some constant value while minimizing the upper boun on L(ĥ) over m an n, we obtain m c 5 n (l+2r)/2l ln n for a constant c 5 > 0 which epens polynomially only on l an r. We conclue that when the imensionality l of X is smaller than twice the smoothness parameter r, the sample size m grows polynomially in n at a rate no larger than n (1+r)/l ; i.e., partial information about the target g is worth approximately a polynomial number of examples. For l >2r, n grows polynomially in m at a rate no larger than m 2 /ln m; i.e., information obtaine from examples is worth a polynomial amount of partial information. We have focuse so far on ealing with the ieal learning scenario in which the teacher has access to the optimal information operator Nn an optimal hypothesis class N. The use of such optimally efficient information was require from an n information theoretic point of view in orer to calculate the trae-off between the sample complexity m an information carinality n. But we have not specifie the form of such optimal information an hypothesis class. In the next result we state a lower boun on the minimal partial information error I n, ( ) an subsequently show that there exists an operator an a hypothesis class which almost achieve this lower boun.

Least-Squares Regression on Sparse Spaces

Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction