Narrowing confidence interval width of PAC learning risk function by algorithmic inference

Similar documents
Twisting Sample Observations with Population Properties to learn

A Language for Task Orchestration and its Semantic Properties

PAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht

Analysis of Spectral Kernel Design based Semi-supervised Learning

Applications of Discrete Mathematics to the Analysis of Algorithms

COMS 4771 Introduction to Machine Learning. Nakul Verma

An Introduction to Statistical Theory of Learning. Nakul Verma Janelia, HHMI

On the Sample Complexity of Noise-Tolerant Learning

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Computational Learning Theory

Yale University Department of Computer Science. The VC Dimension of k-fold Union

Margin Maximizing Loss Functions

A Necessary Condition for Learning from Positive Examples

A Tutorial on Computational Learning Theory Presented at Genetic Programming 1997 Stanford University, July 1997

The sample complexity of agnostic learning with deterministic labels

Learnability and the Doubling Dimension

General Patterns for Nonmonotonic Reasoning: From Basic Entailments to Plausible Relations

Statistical and Computational Learning Theory

Computational Learning Theory

A Generalization of Principal Component Analysis to the Exponential Family

Randomized Simultaneous Messages: Solution of a Problem of Yao in Communication Complexity

Computational Learning Theory. Definitions

CONVEX OPTIMIZATION OVER POSITIVE POLYNOMIALS AND FILTER DESIGN. Y. Genin, Y. Hachez, Yu. Nesterov, P. Van Dooren

Lund Institute of Technology Centre for Mathematical Sciences Mathematical Statistics

Self-Testing Polynomial Functions Efficiently and over Rational Domains

CS 6375: Machine Learning Computational Learning Theory

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Models of Language Acquisition: Part II

On Urquhart s C Logic

10.1 The Formal Model

Citation Osaka Journal of Mathematics. 43(2)

Computational Learning Theory

Boosting and Hard-Core Sets

The Vapnik-Chervonenkis Dimension

Neural Network Learning: Testing Bounds on Sample Complexity

Lecture 3: Introduction to Complexity Regularization

CSE 648: Advanced algorithms

Discriminative Learning can Succeed where Generative Learning Fails

hal , version 1-27 Mar 2014

Generalization, Overfitting, and Model Selection

Computational Learning Theory: Probably Approximately Correct (PAC) Learning. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Computational Learning Theory

Adaptive Sampling Under Low Noise Conditions 1

Sample width for multi-category classifiers

Computational Learning Theory. CS534 - Machine Learning

THE VAPNIK- CHERVONENKIS DIMENSION and LEARNABILITY

Computational Learning Theory for Artificial Neural Networks

CS340 Machine learning Lecture 4 Learning theory. Some slides are borrowed from Sebastian Thrun and Stuart Russell

Machine Learning

k-fold unions of low-dimensional concept classes

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

The University of Hong Kong Department of Statistics and Actuarial Science STAT2802 Statistical Models Tutorial Solutions Solutions to Problems 71-80

Part of the slides are adapted from Ziko Kolter

Sample Complexity of Learning Independent of Set Theory

1 The Probably Approximately Correct (PAC) Model

A Result of Vapnik with Applications

Block vs. Stream cipher

MACHINE LEARNING ADVANCED MACHINE LEARNING

Learning symmetric non-monotone submodular functions

MACHINE LEARNING - CS671 - Part 2a The Vapnik-Chervonenkis Dimension

Statistical Learning Learning From Examples

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Hence C has VC-dimension d iff. π C (d) = 2 d. (4) C has VC-density l if there exists K R such that, for all

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Machine Learning

EVALUATING MISCLASSIFICATION PROBABILITY USING EMPIRICAL RISK 1. Victor Nedel ko

PAC Learning Introduction to Machine Learning. Matt Gormley Lecture 14 March 5, 2018

The Connectivity of Boolean Satisfiability: Computational and Structural Dichotomies

Introduction to Machine Learning

Maximal Width Learning of Binary Functions

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Causality and Boundary of wave solutions

Cognitive Cyber-Physical System

CS446: Machine Learning Spring Problem Set 4

The Decision List Machine

A Probabilistic Upper Bound on Differential Entropy

Understanding Generalization Error: Bounds and Decompositions

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

HANDBOOK OF APPLICABLE MATHEMATICS

Relating Data Compression and Learnability

Computational Learning Theory

VARIATIONAL CALCULUS IN SPACE OF MEASURES AND OPTIMAL DESIGN

Generalization theory

Computational Learning Theory. CS 486/686: Introduction to Artificial Intelligence Fall 2013

On the Performance of Random Vector Quantization Limited Feedback Beamforming in a MISO System

Optimal Routing Policy in Two Deterministic Queues

Visual cryptography schemes with optimal pixel expansion

ab = c a If the coefficients a,b and c are real then either α and β are real or α and β are complex conjugates

An Improved Quantum Fourier Transform Algorithm and Applications

Algebra II Learning Targets

Correlation at Low Temperature: I. Exponential Decay

Math 494: Mathematical Statistics

F O R SOCI AL WORK RESE ARCH

Name (NetID): (1 Point)

Visual Cryptography Schemes with Optimal Pixel Expansion

Lecture 29: Computational Learning Theory

The Fast Fourier Transform

Observations on the Stability Properties of Cooperative Systems

Support Vector Machines.

Conway s RATS Sequences in Base 3

Transcription:

Narrowing confidence interval width of PAC learning risk function by algorithmic inference Bruno Apolloni, Dario Malchiodi Dip. di Scienze dell Informazione, Università degli Studi di Milano Via Comelico 39/4 235 Milano - Italy E-mail: apolloni,malchiodi @dsi.unimi.it Abstract We narrow the width of the confidence interval introduced by Vapnik and Chervonenkis for the risk function in PAC learning boolean functions through non-consistent hypotheses. To obtain this improvement for a large class of learning algorithms we introduce both a theoretical framework for statistical inference of functions and a concept class complexity index, the detail, that is dual to the Vapnik-Chervonenkis dimension. Detail of a class and maximum number of mislabelled points add up linearly to constitute the learning problem complexity. The sample complexity dependency on this index is almost similar to the one on VC dimension. We formally prove that the former leads to confidence intervals for the risk function that are definitely narrower than in the latter. Introduction A suitable way of revisiting PAC learning is to assume that probabilities are random variables per se. Right from the start, the object of our inference is a string of data (possibly of infinite length) that we partition into a prefix we assume to be known at present (and therefore call

sample) and a suffix of unknown future data we call a population (see Figure ). All these data share the feature of being independent observations of the same phenomenon. Therefore without loss of generality we assume these data as the output of some function having input from a set of independent random variables Í uniformly distributed in the unit interval effectively, the most essential source of randomness. We will refer to Å Í µ as a sampling mechanism ½ ¼ ¼ ½ ¼ ½ ½ ¼ ½ ¼ ¼ ¼ ¼ ½ ¼ ½ ½ ¼ ¼ ¼ ½ ¼ ¼ ¼ ¼ ½ ¼ ½ ½ ¼ ½ ¼ ½ ¼ sample population Figure : Sample and population of random bits. and to as an explaining function. This function is precisely the object of our inference. Let us consider, for instance, the sample mechanism Å Í Ô µ, where Í is the above uniform random variable and Ô Ùµ ½ if Ù Ô and ¼ otherwise, describes the sample and population from a Bernoulli random variable of mean Ô as in Figure. As can be seen from Figure 2, for a given sequence of Í s we obtain different binary strings depending on the height Ô of the threshold line. Thus it is easy to derive the following implication chain Ã Ô µ Ô Ôµ Ã Ô ½µ () (where is the number of ½ s observed in the sample, and Ã Ô denotes the random variable counting the number of ½ s in the sample if the threshold in the explaining function switches to Ô for the same realizations of Í) and the consequent bound on the probability È Ã Ô µ È Ô Ôµ È Ã Ô ½µ (2) which characterizes the cumulative distribution function (c.d.f.) Ô of the parameter Ô. In our statistical framework indeed the above height Ô is a random variable in ¼ ½ representing the asymptotic (Å ½ in Figure 2) frequency of ½ in the populations that are compatible, as Such a always exists by the probability integral transformation theorem [5]. By default capital letters (such as Í, ) will denote random variables and small letters (Ù,Ü) their corresponding realizations; the sets the realizations belong to will be denoted by capital gothic letters (Í ). 2

a function of Í suffix of the sample (of size Ñ in the figure), with the number of actually observed ½ s. Equation 2 comes straight from marginalizing the joint Í s distribution in respect to the population when we deal with sample statistic Ã Ô and vice versa when we deal with population parameter Ô. Note the asymmetry in the implications. It derives from the fact that: (i) ½ Í Ô Ñ Å ¼ Ñ Figure 2: Generating a Bernoullian sample. Horizontal axis: index of the Í realizations; vertical axis: both Í (lines) and (bullets) values. The threshold line Ô realizes a mapping from Í to through Ô. raising the threshold parameter in Ô cannot decrease the number of ½ in the observed sample, but (ii) we can recognize that such a raising occurred only if we really see a number of ones in the sample greater than. We will refer to every expression similar to () as a twisting argument, since it allows us to exchange events on parameters with events on statistics. Twisting sample with population properties is our approach, that we call algorithmic inference, to statistical inference. The peculiarity of PAC learning boolean functions is that we need more than one sampled point to recognize that the probability measure of the error domain is less than a given right implication in (). In section 2 we will denote these points as sentry points of a concept and will derive bounds on the distribution of the above measure. These bounds are based on the supremum of the cardinality of the sentry points over the set of symmetric differences between concepts and candidate hypotheses, that we call detail, and an analogous limit on the number of points misclassified by the hypotheses. In Section 3 we draw the consequent confidence intervals for the error probability and compare them with the analogous commonly used intervals proposed by Vapnik and Chervonenkis [6]. A set of graphs show the great gain in interval amplitude achived by our approach, specially when the sample size is modest. 3

2 Algorithmic inference of Boolean functions In the typical framework of PAC learning theory the parameter to be investigated is the probability that the inferred function will compute erroneously on next inputs (will not explain new sampled points). In greater detail, the general form of the sample is: Ñ µ ½ Ñ where are boolean variables. If we assume that for every Å and every Å an exists in a Boolean class, call it, such that Å µµ ½ Å, then we are interested in the measure of the symmetric difference between an (that we denote hypothesis ) computed from Ñ by a function (that we call learning algorithm), and any such 2. (See figure 3). The peculiarity of this inference problem is that some degrees of freedom of our h c c c 2 c 3 Figure 3: Circle describing the sample and possible circles describing the population. Small rombes and circles: sampled points; Line filled region: symmetric difference. sample are burned by the property whose confidence interval we are looking for. Namely, if we denote by the random variable Í the measure of the above symmetric difference, the twisting argument reads (with a caveat on the left part): Ì Ø Í ½ Í µ Ì Ø Í (3) 2 Note that in our approach for a given sample ÞÑ we consider possible suffixes covered by. 4

where Ø Í is the number of actual sample points falling in (the empirical risk in the Vapnik notation [6]), Ì the analogous statistic for an enlargment of of measure, and is a new complexity measure directly referred to. The threshold in the left inequality is due to the fact that is a function of a sample specification Þ Ñ on its own turn, so that, if is such that the symmetric difference grows with the set of included sample points and vice-versa (the mentioned caveat), Í µ implies that any enlargement region containing must violate consistency on at least one more of the sampled points at the basis of s computation. The quantity is an upper bound to the number of sample points sufficient to witness that according to a new hypothesis containing has been generated after an increase of Í. These points, which we figure as concepts sentinels, are formally described as follows. Definition 2.. For a concept class on a space Ü the set of specifications of, a sentry function is a total function Ë ¾ satisfying the conditions (i) Ë µ for all ¾. (sentinels are outside the sentinelled concept) (ii) Having introduced the sets Ë µ and ÙÔ µ ¼ ¾ such that ¼ and ¼ µ, if ¾ ¾ ÙÔ ½ µ, then ¾ Ë ½ µ. (sentinels are inside the invading concept) (iii) No Ë ¼ Ë exists satisfying (i) and (ii) and having the property that Ë ¼ µ Ë µ for every ¾. (we look for a minimal set of sentinels) (iv) Whenever ½ and ¾ are such that ½ ¾ Ë ¾ µ and ¾ Ë ½ µ, then the restriction of Ë to ½ ÙÔ ½ µò ¾ is a sentry function on this set. (sentinels are honest watchers) Ë µ is the frontier of upon Ë, and its element are called sentry points. The quantity ÙÔ Ë µ is called detail of. For ¾, denoting the set ¾, the detail Ë of the class ¾ is the quantity ÙÔ. ¾ A given concept class might admit more than one sentry function. Condition (iv) prevents us 5

from building sentry functions which are unnatural, where some frontier points of a function ½ have the sole role of artificially increasing the elements of ½ Ë ½ µ in order to prevent it from being included in another ¾. The mentioned condition states that this role can be considered only a side effect of points which are primarily involved in sentinelling some formula. Extension of the domain of Ë in order to include is necessary to state some key properties, such as Fact 2. below. Example 2.. Let us consider the class ¾ of boolean formulas on ¼ ½ ¾ in terms of propositional variables ½ and ¾ : ¾ ¼ ½ ½ ¾ ½ ¾. After giving label to points inside concepts and to those outside them, we represent the related supports as follows: ¼¼ ¼½ ½¼ ½½ ½ ¼ ¾ ½ ¾ ½ ¾ ½ By inspection of points (i)-(iv) of definition 2., a possible outer sentry function for ¾ is: Ë ½ µ ½½, Ë ¾ µ ¼½ ½¼, Ë µ ¼½, Ë µ ½¼, Ë µ, where sentry points are marked with a circle. The reader can realize that Ë ½ µ ¼¼ ¼½ is unfeasible according to condition (iv). Indeed point ¼¼ has the sole task of removing ¾ and from ÙÔ ½ µ. But it is useless for sentinelling, in conjunction with ¼½, concepts belonging to ÙÔ ½ µò ¾. The detail is a parameter difficult to compute, except in case of some Boolean classes usually referred to in the literature: for instance is ½ if the elements of are oriented half-lines on Ê, ¾ if they are segments or circles on Ê ¾ and if they are convex polygons on Ê ¾ having exactly edges. The interested reader can find detailed examples in [] and [2]. However, although semantically different from the Vapnik-Cervonenkis dimension, this complexity index is related to the latter 6

by the following: Fact 2.. [] Let us denote by Î µ the Vapnik Cervonenkis dimension [4] of a concept class, then Î µ ½µ ½ Î µ ½µ Substituting with the new complexity index in (3) we find bounds to the sampling complexity for a wide class of learning algorithms according to the following theorem: Theorem 2.. [3] For a given probability space ȵ where is a -algebra on and È is a possibly unknown probability measure defined over, assume we are given. a concept class on with detail ; 2. a sample Ñ drawn from the fixed space and labelled according to a ¾ labelling an infinite suffix of it; 3. a function Þ Ñ misclassifying at least Ø ¼ and at most Ø ¾ Æ points of total probability not greater than ¾ ¼ ½µ. Let us denote Þ Ñ µ and Í the random variable representing the probability measure of for any ¾ labelling Þ Ñ as in 2. Then for each ¾ ½µ where Á Ø Ñ function. Á ½ Ø ¼ Ñ Ø ¼ µ È Í Á Ø Ñ Øµ ½µ (4) ص ½µ ½ Ø ½ ¼ Ñ ½ µ Ñ is the incomplete Beta 3 Confidence intervals for the learning error Inequalities (4) fix a lower and an upper bound for the cumulative distribution function Í allowing us to fix the following result: Theorem 3.. For a probability space ȵ, a concept class, a learning algorithm 7

and parameters Ñ,, Ø ¼ and Ø as in Theorem 2., the event Ð Í Ð (5) has probability greater than ½ Æ, where Ð is Æ ¾ quantile of the Beta distribution of parameters ½ Ø ¼ and Ñ Ø ¼, and Ð is the analogous ½ Æ ¾ quantile for parameters Ø and Ñ Øµ ½ Proof. Starting from (4) and getting Á Ð Ø Ñ Øµ ½µ Á Ð Ø ¼ ½ Ñ Ø ¼ µ as a lower bound to Í Ð µ Í Ð µ È Ð Í Ð µ we can obtain the interval Ð Ð µ by dividing the probability measure outside it in two equal parts, that is solving the equation system Á Ð Ø Ñ Øµ ½µ ½ Æ ¾ Á Ð Ø ¼ ½ Ñ Ø ¼ µ Æ ¾ (6) with respect to Ð and Ð. Hence the claim follows. In true statistical inference notation, the statement È Ð Í Ð µ ½ Æ means that Ð and Ð are the extremes of a confidence interval at level Æ for the learning error Í. When Ñ grows, as the numerical solutions of (6) become difficult to handle, the Binomial distribution underlying them can be approximated with a Gaussian law, following the De Moivre- Laplace theorem [7]. The commonly used confidence interval Á for the same probability comes from the following theorem: Theorem 3.2. [6] Let ««¾ be a Boolean concept class of bounded Vapnik- Chervonenkis dimension Î, and let «µ be the frequency of errors computed from the sample for a concept «¾. Then, for Ñ and simultaneously for all the concepts in, the event «µ ¾ ÐÓ ¾Ñ ½ ÐÓ Æ Ñ Ê «µ «µ ¾ ÐÓ ¾Ñ ½ ÐÓ Æ Ñ (7) has proability ½ Æ. 8

Note that Theorem 2. can be issued for a number of mislabelled points and frontier cardinality of the single learning task (essentially maintaining the same proof) as Theorem 3. does for the sole former quantity. For simplicity s sake, in the following numerical example we refer to upper bounds (or values constant with concepts and hypotheses) on both complexity indices and empirical error. Figure 4 compares our confidence intervals with the ones obtained computing formula (7) for a set of values of the number of mislabelled points and of the complexity indices. Following the previous remark, we compute the former quantity in case of Vapnik formula as Ñ, where is the empirical risk (here constant with «). In the same spirit we assume Î µ and ¼ ( continuous variable). For a sample of ½¼¼, ½¼¼¼ and ½¼¼¼¼¼¼ 2.5 5 7.5 2 3 5 7 9 2 35 57 7 9.5 8 6 4 2-8 6 4 2-2 4 6 8 (a) (b) (c) Figure 4: Comparison between bilateral ¼ confidence intervals for actual risk. -axis: number of misclassified points. -axis: Vapnik-Chervonenkis dimension and class detail. -axis: confidence interval limits. Light surfaces: Vapnik-Chervonenkis confidence intervals. Dark surfaces: Our confidence intervals. (a) Sample size Ñ ½¼¼, (b) Sample size Ñ ½¼¼¼, (c) Sample size Ñ ½¼¼¼¼¼¼. elements respectively, the three graphs show the limits of the ¼ -confidence intervals drawn using both Vapnik (external surfaces) and our (internal surfaces) bounds. Moreover, to appreciate the differences even better, in Fig. 5 we draw a section for concept complexity =4 in function of the number of misclassified points. We used dark gray lines for plotting bounds from (6) and light gray lines for those from their Gaussian approximation. Note that these different bounds are distinguishable only in the first figure. The figure shows that: (i) our confidence intervals are always more accurate than Vapnik s; this benefit accounts for a narrowing of one order at the smallest sample size, while tends to disappear when the sample size increases, and (ii) our 9

2.5.25.8.75.6.5.5.25.4 -.5-2 4 6 8 -.25.2 2 4 6 8 2 4 6 8 6 (a) (b) (c) Figure 5: Same comparison as in figure 4 for class complexity =. -axis: number of misclassified points. -axis: confidence interval limits. Black lines: Vapnik-Chervonenkis confidence intervals. Dark gray lines: Our confidence intervals. Light gray lines: Our confidence intervals obtained using Gaussian approximation. confidence intervals are consistent, that is they are always contained in ¼ ½. References [] APOLLONI, B., AND CHIARAVALLI, S. Pac learning of concept classes through the boundaries of their items. Theoretical Computer Science, 72 (997), 9 2. [2] APOLLONI, B., AND MALCHIODI, D. Gaining degrees of freedom in subsymbolic learning. Theoretical Computer Science 255 (2), 295 32. [3] APOLLONI, B., MALCHIODI, D., OROVAS, C., AND PALMAS, G. From synapses to rules. Cognitive Systems Research. In press. [4] BLUMER, A., EHRENFREUCHT, A., HAUSSLER, D., AND WARMUTH, M. Learnability and the vapnik-chervonenkis dimension. Journal of the ACM, 36 (989), 929 965. [5] ROHATGI, V. K. An Introduction to Probablity Theory and Mathematical Statistics. Wiley Series in Probability and Mathematical Statistcs. John Wiley & Sons, New York, 976. [6] VAPNIK, V. Estimation of dependencies based on empirical data. Springer, New York, 982. [7] WILKS, S. S. Mathematical Statistics. Wiley Publications in Statistics. John Wiley, New York, 965.