Narrowing confidence interval width of PAC learning risk function by algorithmic inference

Narrowing confidence interval width of PAC learning risk function by algorithmic inference Bruno Apolloni, Dario Malchiodi Dip. di Scienze dell Informazione, Università degli Studi di Milano Via Comelico 39/4 235 Milano - Italy E-mail: apolloni,malchiodi @dsi.unimi.it Abstract We narrow the width of the confidence interval introduced by Vapnik and Chervonenkis for the risk function in PAC learning boolean functions through non-consistent hypotheses. To obtain this improvement for a large class of learning algorithms we introduce both a theoretical framework for statistical inference of functions and a concept class complexity index, the detail, that is dual to the Vapnik-Chervonenkis dimension. Detail of a class and maximum number of mislabelled points add up linearly to constitute the learning problem complexity. The sample complexity dependency on this index is almost similar to the one on VC dimension. We formally prove that the former leads to confidence intervals for the risk function that are definitely narrower than in the latter. Introduction A suitable way of revisiting PAC learning is to assume that probabilities are random variables per se. Right from the start, the object of our inference is a string of data (possibly of infinite length) that we partition into a prefix we assume to be known at present (and therefore call

sample) and a suffix of unknown future data we call a population (see Figure ). All these data share the feature of being independent observations of the same phenomenon. Therefore without loss of generality we assume these data as the output of some function having input from a set of independent random variables Í uniformly distributed in the unit interval effectively, the most essential source of randomness. We will refer to Å Í µ as a sampling mechanism ½ ¼ ¼ ½ ¼ ½ ½ ¼ ½ ¼ ¼ ¼ ¼ ½ ¼ ½ ½ ¼ ¼ ¼ ½ ¼ ¼ ¼ ¼ ½ ¼ ½ ½ ¼ ½ ¼ ½ ¼ sample population Figure : Sample and population of random bits. and to as an explaining function. This function is precisely the object of our inference. Let us consider, for instance, the sample mechanism Å Í Ô µ, where Í is the above uniform random variable and Ô Ùµ ½ if Ù Ô and ¼ otherwise, describes the sample and population from a Bernoulli random variable of mean Ô as in Figure. As can be seen from Figure 2, for a given sequence of Í s we obtain different binary strings depending on the height Ô of the threshold line. Thus it is easy to derive the following implication chain Ã Ô µ Ô Ôµ Ã Ô ½µ () (where is the number of ½ s observed in the sample, and Ã Ô denotes the random variable counting the number of ½ s in the sample if the threshold in the explaining function switches to Ô for the same realizations of Í) and the consequent bound on the probability È Ã Ô µ È Ô Ôµ È Ã Ô ½µ (2) which characterizes the cumulative distribution function (c.d.f.) Ô of the parameter Ô. In our statistical framework indeed the above height Ô is a random variable in ¼ ½ representing the asymptotic (Å ½ in Figure 2) frequency of ½ in the populations that are compatible, as Such a always exists by the probability integral transformation theorem [5]. By default capital letters (such as Í, ) will denote random variables and small letters (Ù,Ü) their corresponding realizations; the sets the realizations belong to will be denoted by capital gothic letters (Í ). 2

a function of Í suffix of the sample (of size Ñ in the figure), with the number of actually observed ½ s. Equation 2 comes straight from marginalizing the joint Í s distribution in respect to the population when we deal with sample statistic Ã Ô and vice versa when we deal with population parameter Ô. Note the asymmetry in the implications. It derives from the fact that: (i) ½ Í Ô Ñ Å ¼ Ñ Figure 2: Generating a Bernoullian sample. Horizontal axis: index of the Í realizations; vertical axis: both Í (lines) and (bullets) values. The threshold line Ô realizes a mapping from Í to through Ô. raising the threshold parameter in Ô cannot decrease the number of ½ in the observed sample, but (ii) we can recognize that such a raising occurred only if we really see a number of ones in the sample greater than. We will refer to every expression similar to () as a twisting argument, since it allows us to exchange events on parameters with events on statistics. Twisting sample with population properties is our approach, that we call algorithmic inference, to statistical inference. The peculiarity of PAC learning boolean functions is that we need more than one sampled point to recognize that the probability measure of the error domain is less than a given right implication in (). In section 2 we will denote these points as sentry points of a concept and will derive bounds on the distribution of the above measure. These bounds are based on the supremum of the cardinality of the sentry points over the set of symmetric differences between concepts and candidate hypotheses, that we call detail, and an analogous limit on the number of points misclassified by the hypotheses. In Section 3 we draw the consequent confidence intervals for the error probability and compare them with the analogous commonly used intervals proposed by Vapnik and Chervonenkis [6]. A set of graphs show the great gain in interval amplitude achived by our approach, specially when the sample size is modest. 3

2 Algorithmic inference of Boolean functions In the typical framework of PAC learning theory the parameter to be investigated is the probability that the inferred function will compute erroneously on next inputs (will not explain new sampled points). In greater detail, the general form of the sample is: Ñ µ ½ Ñ where are boolean variables. If we assume that for every Å and every Å an exists in a Boolean class, call it, such that Å µµ ½ Å, then we are interested in the measure of the symmetric difference between an (that we denote hypothesis ) computed from Ñ by a function (that we call learning algorithm), and any such 2. (See figure 3). The peculiarity of this inference problem is that some degrees of freedom of our h c c c 2 c 3 Figure 3: Circle describing the sample and possible circles describing the population. Small rombes and circles: sampled points; Line filled region: symmetric difference. sample are burned by the property whose confidence interval we are looking for. Namely, if we denote by the random variable Í the measure of the above symmetric difference, the twisting argument reads (with a caveat on the left part): Ì Ø Í ½ Í µ Ì Ø Í (3) 2 Note that in our approach for a given sample ÞÑ we consider possible suffixes covered by. 4

where Ø Í is the number of actual sample points falling in (the empirical risk in the Vapnik notation [6]), Ì the analogous statistic for an enlargment of of measure, and is a new complexity measure directly referred to. The threshold in the left inequality is due to the fact that is a function of a sample specification Þ Ñ on its own turn, so that, if is such that the symmetric difference grows with the set of included sample points and vice-versa (the mentioned caveat), Í µ implies that any enlargement region containing must violate consistency on at least one more of the sampled points at the basis of s computation. The quantity is an upper bound to the number of sample points sufficient to witness that according to a new hypothesis containing has been generated after an increase of Í. These points, which we figure as concepts sentinels, are formally described as follows. Definition 2.. For a concept class on a space Ü the set of specifications of, a sentry function is a total function Ë ¾ satisfying the conditions (i) Ë µ for all ¾. (sentinels are outside the sentinelled concept) (ii) Having introduced the sets Ë µ and ÙÔ µ ¼ ¾ such that ¼ and ¼ µ, if ¾ ¾ ÙÔ ½ µ, then ¾ Ë ½ µ. (sentinels are inside the invading concept) (iii) No Ë ¼ Ë exists satisfying (i) and (ii) and having the property that Ë ¼ µ Ë µ for every ¾. (we look for a minimal set of sentinels) (iv) Whenever ½ and ¾ are such that ½ ¾ Ë ¾ µ and ¾ Ë ½ µ, then the restriction of Ë to ½ ÙÔ ½ µò ¾ is a sentry function on this set. (sentinels are honest watchers) Ë µ is the frontier of upon Ë, and its element are called sentry points. The quantity ÙÔ Ë µ is called detail of. For ¾, denoting the set ¾, the detail Ë of the class ¾ is the quantity ÙÔ. ¾ A given concept class might admit more than one sentry function. Condition (iv) prevents us 5

from building sentry functions which are unnatural, where some frontier points of a function ½ have the sole role of artificially increasing the elements of ½ Ë ½ µ in order to prevent it from being included in another ¾. The mentioned condition states that this role can be considered only a side effect of points which are primarily involved in sentinelling some formula. Extension of the domain of Ë in order to include is necessary to state some key properties, such as Fact 2. below. Example 2.. Let us consider the class ¾ of boolean formulas on ¼ ½ ¾ in terms of propositional variables ½ and ¾ : ¾ ¼ ½ ½ ¾ ½ ¾. After giving label to points inside concepts and to those outside them, we represent the related supports as follows: ¼¼ ¼½ ½¼ ½½ ½ ¼ ¾ ½ ¾ ½ ¾ ½ By inspection of points (i)-(iv) of definition 2., a possible outer sentry function for ¾ is: Ë ½ µ ½½, Ë ¾ µ ¼½ ½¼, Ë µ ¼½, Ë µ ½¼, Ë µ, where sentry points are marked with a circle. The reader can realize that Ë ½ µ ¼¼ ¼½ is unfeasible according to condition (iv). Indeed point ¼¼ has the sole task of removing ¾ and from ÙÔ ½ µ. But it is useless for sentinelling, in conjunction with ¼½, concepts belonging to ÙÔ ½ µò ¾. The detail is a parameter difficult to compute, except in case of some Boolean classes usually referred to in the literature: for instance is ½ if the elements of are oriented half-lines on Ê, ¾ if they are segments or circles on Ê ¾ and if they are convex polygons on Ê ¾ having exactly edges. The interested reader can find detailed examples in [] and [2]. However, although semantically different from the Vapnik-Cervonenkis dimension, this complexity index is related to the latter 6

by the following: Fact 2.. [] Let us denote by Î µ the Vapnik Cervonenkis dimension [4] of a concept class, then Î µ ½µ ½ Î µ ½µ Substituting with the new complexity index in (3) we find bounds to the sampling complexity for a wide class of learning algorithms according to the following theorem: Theorem 2.. [3] For a given probability space Èµ where is a -algebra on and È is a possibly unknown probability measure defined over, assume we are given. a concept class on with detail ; 2. a sample Ñ drawn from the fixed space and labelled according to a ¾ labelling an infinite suffix of it; 3. a function Þ Ñ misclassifying at least Ø ¼ and at most Ø ¾ Æ points of total probability not greater than ¾ ¼ ½µ. Let us denote Þ Ñ µ and Í the random variable representing the probability measure of for any ¾ labelling Þ Ñ as in 2. Then for each ¾ ½µ where Á Ø Ñ function. Á ½ Ø ¼ Ñ Ø ¼ µ È Í Á Ø Ñ Øµ ½µ (4) Øµ ½µ ½ Ø ½ ¼ Ñ ½ µ Ñ is the incomplete Beta 3 Confidence intervals for the learning error Inequalities (4) fix a lower and an upper bound for the cumulative distribution function Í allowing us to fix the following result: Theorem 3.. For a probability space Èµ, a concept class, a learning algorithm 7

and parameters Ñ,, Ø ¼ and Ø as in Theorem 2., the event Ð Í Ð (5) has probability greater than ½ Æ, where Ð is Æ ¾ quantile of the Beta distribution of parameters ½ Ø ¼ and Ñ Ø ¼, and Ð is the analogous ½ Æ ¾ quantile for parameters Ø and Ñ Øµ ½ Proof. Starting from (4) and getting Á Ð Ø Ñ Øµ ½µ Á Ð Ø ¼ ½ Ñ Ø ¼ µ as a lower bound to Í Ð µ Í Ð µ È Ð Í Ð µ we can obtain the interval Ð Ð µ by dividing the probability measure outside it in two equal parts, that is solving the equation system Á Ð Ø Ñ Øµ ½µ ½ Æ ¾ Á Ð Ø ¼ ½ Ñ Ø ¼ µ Æ ¾ (6) with respect to Ð and Ð. Hence the claim follows. In true statistical inference notation, the statement È Ð Í Ð µ ½ Æ means that Ð and Ð are the extremes of a confidence interval at level Æ for the learning error Í. When Ñ grows, as the numerical solutions of (6) become difficult to handle, the Binomial distribution underlying them can be approximated with a Gaussian law, following the De Moivre- Laplace theorem [7]. The commonly used confidence interval Á for the same probability comes from the following theorem: Theorem 3.2. [6] Let ««¾ be a Boolean concept class of bounded Vapnik- Chervonenkis dimension Î, and let «µ be the frequency of errors computed from the sample for a concept «¾. Then, for Ñ and simultaneously for all the concepts in, the event «µ ¾ ÐÓ ¾Ñ ½ ÐÓ Æ Ñ Ê «µ «µ ¾ ÐÓ ¾Ñ ½ ÐÓ Æ Ñ (7) has proability ½ Æ. 8

Note that Theorem 2. can be issued for a number of mislabelled points and frontier cardinality of the single learning task (essentially maintaining the same proof) as Theorem 3. does for the sole former quantity. For simplicity s sake, in the following numerical example we refer to upper bounds (or values constant with concepts and hypotheses) on both complexity indices and empirical error. Figure 4 compares our confidence intervals with the ones obtained computing formula (7) for a set of values of the number of mislabelled points and of the complexity indices. Following the previous remark, we compute the former quantity in case of Vapnik formula as Ñ, where is the empirical risk (here constant with «). In the same spirit we assume Î µ and ¼ ( continuous variable). For a sample of ½¼¼, ½¼¼¼ and ½¼¼¼¼¼¼ 2.5 5 7.5 2 3 5 7 9 2 35 57 7 9.5 8 6 4 2-8 6 4 2-2 4 6 8 (a) (b) (c) Figure 4: Comparison between bilateral ¼ confidence intervals for actual risk. -axis: number of misclassified points. -axis: Vapnik-Chervonenkis dimension and class detail. -axis: confidence interval limits. Light surfaces: Vapnik-Chervonenkis confidence intervals. Dark surfaces: Our confidence intervals. (a) Sample size Ñ ½¼¼, (b) Sample size Ñ ½¼¼¼, (c) Sample size Ñ ½¼¼¼¼¼¼. elements respectively, the three graphs show the limits of the ¼ -confidence intervals drawn using both Vapnik (external surfaces) and our (internal surfaces) bounds. Moreover, to appreciate the differences even better, in Fig. 5 we draw a section for concept complexity =4 in function of the number of misclassified points. We used dark gray lines for plotting bounds from (6) and light gray lines for those from their Gaussian approximation. Note that these different bounds are distinguishable only in the first figure. The figure shows that: (i) our confidence intervals are always more accurate than Vapnik s; this benefit accounts for a narrowing of one order at the smallest sample size, while tends to disappear when the sample size increases, and (ii) our 9

2.5.25.8.75.6.5.5.25.4 -.5-2 4 6 8 -.25.2 2 4 6 8 2 4 6 8 6 (a) (b) (c) Figure 5: Same comparison as in figure 4 for class complexity =. -axis: number of misclassified points. -axis: confidence interval limits. Black lines: Vapnik-Chervonenkis confidence intervals. Dark gray lines: Our confidence intervals. Light gray lines: Our confidence intervals obtained using Gaussian approximation. confidence intervals are consistent, that is they are always contained in ¼ ½. References [] APOLLONI, B., AND CHIARAVALLI, S. Pac learning of concept classes through the boundaries of their items. Theoretical Computer Science, 72 (997), 9 2. [2] APOLLONI, B., AND MALCHIODI, D. Gaining degrees of freedom in subsymbolic learning. Theoretical Computer Science 255 (2), 295 32. [3] APOLLONI, B., MALCHIODI, D., OROVAS, C., AND PALMAS, G. From synapses to rules. Cognitive Systems Research. In press. [4] BLUMER, A., EHRENFREUCHT, A., HAUSSLER, D., AND WARMUTH, M. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM, 36 (989), 929 965. [5] ROHATGI, V. K. An Introduction to Probablity Theory and Mathematical Statistics. Wiley Series in Probability and Mathematical Statistcs. John Wiley & Sons, New York, 976. [6] VAPNIK, V. Estimation of dependencies based on empirical data. Springer, New York, 982. [7] WILKS, S. S. Mathematical Statistics. Wiley Publications in Statistics. John Wiley, New York, 965.