Twisting Sample Observations with Population Properties to learn

Size: px

Start display at page:

Download "Twisting Sample Observations with Population Properties to learn"

Kathleen White
5 years ago
Views:

1 Twisting Sample Observations with Population Properties to learn B. APOLLONI, S. BASSIS, S. GAITO and D. MALCHIODI Dipartimento di Scienze dell Informazione Università degli Studi di Milano Via Comelico 39/ Milano ITALY Abstract: We introduce a theoretic framework for Probably Approximately Correct learning. This enables us to compute the distribution law of the random variable representing the probability of region where the hypothesis is incorrect. The distinguishing feature in respect to the inference of an analogous probability from Bernoulli variable is the dependence of this distribution on a complexity parameter playing a companion role of the Vapnik-Chervonenkis dimension. Key-Words: Computational learning, statistical inference, twisting argument. 1 Introduction A very innovative aspect of PAC learning [7] is to assume that probabilities are random variables per se. This is not the point of confidence intervals in classic statistical theory, where randomness is due to the extremes of the intervals rather than to the value of the probabilistic parameter at hand. For instance, let us consider a Bernoulli variable X and the inequality min{ T P(θ T) 1 δ} (1) identifying a 1-δ confidence interval for the parameter θ = P(X=1). Here T is a function f of a random sample (X 1,..., X m ), the minimum is taken over some free parameters of f and the randomness of the event in the brackets only derives from the randomness of the sample. In this statistical framework typical PAC learning inequalities [8] reserve some subtle aspects. In respect to a class C of Boolean functions {g α } and two elements of it representing the target concept c and its approximating hypothesis h, consider the event B = ( sup α Δ R(α) ν(α) < ε) (2) for a given ε, where Δ is a set of reals, α indexes candidate h s within C, R (α) is a probability measure of a related symmetric difference c h and ν(α) the sample frequency of falling in this domain. If we fix ν(α)=0, the probabilistic features of this event comes from the randomness of the set Δ collecting hypotheses whose symmetric difference with c have no points of a random sample inside. However, if in order to approximate from the above the probability of this event we enlarge Δ to the whole set of indices of C, we refer at an event with no sense in the classical statistical theory. On the contrary it makes sense requiring P(B ν(α) = 0) = P( sup R(α) < ε) = P(R(α * ) < ε) >1 δ α Δ (3) for some α * Δ if we assume the probability measure on the error domain related to α * to be, in turn, a random variable. Right from the start, the object of our inference is a string of data X (possibly of infinite length) that we partition into a prefix we assume to be known at present (and therefore call sample) and a suffix of unknown future data we call a population (see Fig. 1). All these data share the feature of being independent observations of the same phenomenon. Therefore without loss of generality we assume these data as the output of some function g θ having input from a set of independent random variables U uniformly distributed in the unit interval. By default, u p X Sample Population Fig. 1. Generating a sample of Bernoulli variables. Independent variables: x-axis: index of the U realizations; y-axis: both U (bars) and X (bullets) values. The threshold line p realizes a mapping from U to X through (4).

2 capital letters (such as U, X) will denote random variables and small letters (u, x) their corresponding realizations. We will refer to M=(U, g θ ) as a sampling mechanism and to g θ as an explaining function, and this function is precisely the object of our inference. Let us consider, for instance, the sample mechanism M=(U, g p ), where g p (u) = 1 if u p 0 otherwise (4) explains sample and population distributed according to a Bernoulli law of mean p. As shown in Fig. 1, for a given sequence of U s we obtain different binary strings depending on the height of the threshold line corresponding to p. Thus it is easy to desume the following implication chain ( k p k) p < p ( ) ( k p k +1) (5) and the consequent bound of the probability ( ) Ρ( P < p ) = F P ( p ) Ρ( K p k +1) (6) Ρ K p k which characterizes the cumulative distribution function (c.d.f.) F P of the parameter P, representing the asymptotic frequency of 1 in the population compatible with the number k of 1 in the sample. Here k denotes the number of 1 in the sample and K p denotes the random variable counting the number of 1 in the sample if the threshold in the explaining function switches to p for the same realizations of U. With reference to the probability space ( Ω, Σ, P), where Ω is the [0, 1] interval, Σ the related sigmaalgebra and P is uniformly distributed on Ω, P in (6) refers to the product space of the sample and population of U s in Ω. The sample and population for the Bernoulli variable are just a function of them. The probabilities regarding the statistic come from the marginalization of the joint distribution with respect to the population, while the distribution of the parameter from the marginalization with respect to the sample. Note the asymmetry in the implications. It derives from the fact that raising the threshold parameter in g p cannot decrease the number of 1 in the observed sample, but we can recognize that such a raising occurred only if we really see a number of ones in the sample greater than k. We will refer to every expression similar to (5) as a twisting argument [4], since it allows us to exchange events on parameters with events on statistics. Its peculiarity lies in the fact that, since the first and last probability in (6) are completely known, we are able to identify the distribution law of the unknown parameter. A more thorough discussion of this tool and of the related algorithmic inference framework can be found in [4]. The principal use we will make of relations like (6) is to compute confidence intervals for the unknown parameter P. Namely, in the above inference framework it does not make sense to assume a specific random variable as given; rather we refer to families of random variables with some free parameters whose distribution law is discovered from a sample through twisting arguments. Nevertheless for both conciseness sake and advisability of linking our results with conventional ones, we will often keep referring by abuse of notation to a given random variable X, yet with random parameters. Within this notation we match our notion of sample, as a prefix of a string of data, with the usual definition of it as a specification of a set of identically distributed random variables. Example [Inferring a probability]. Let X denote a random variable distributed according to a Bernoulli law of mean P, (X 1,,X m ) a sample of size m from X and k=σx i the sum of 1 in a specification of the sample. A symmetric confidence interval of level δ for P is (l i,l s ) where l i is the δ/2 quantile of the Beta distribution of parameters k and m k+1 [6], and l s is the analogous 1 δ /2 quantile for parameters k+1 and m k. Indeed, consider the explanation of X given by (4) and the twisting argument (5). In this case K p ~ follows a Binomial distribution law of parameters m and p ~, so that (6) reads m m p i ( 1 p ) m i F P p i= k ( ) m m p i ( 1 p ) m i (7) i= k +1 Having introduced the incomplete Beta function I β as the c.d.f. of the random variable Be(h,r) following a Beta distribution of parameters h and r, that is I β (h,r) Ρ( Be( h,r) β) =1 h 1 h + r 1 β i ( 1 β) h +r 1 i (8) i= 0 the above bounds can be written as ( ) F P ( p ) I p ( k +1,m k) (9) I p k,m k +1

3 Therefore, getting δ I = I ls ( k +1,m k) I li k,m k +1 ( ) (10) as a lower bound to F P (l s )- F P (l i ) =P(l i < P < l s ), the desired confidence interval can be found by dividing the probability measure outside (l i,l s ) in two equal parts in order to obtain a two-sided interval symmetric in the tail probabilities. We thus obtain the extremes of the interval as the solutions l i and l s of the equations system A more intensive experiment would show that, in the approximation of h/200 with the asymptotic frequency of ones in the suffixes of the first 20 sampled values, on all samples and even for each sample if we draw many suffixes of the same one, almost 100(1-δ) percent of the frets fall within the analytically computed curves. I ls ( k +1,m k) =1 δ /2 (11) I li k,m k +1 ( ) = δ /2 (12) To check the effectiveness of this computation we considered a string of unitary uniform variables representing, respectively, the randomness source of a sample and a population of Bernoulli variables. Then according to the explaining function (4) we computed a sequence of Bernoullian 220 bits long vectors with p rising from 0 to 1. The trajectory described by the point of coordinates k/20 and h/200, computing the frequency of ones in the sample and in the population respectively, is reported along one fret line in Fig. 2. We repeated this experiment 20 times (each time using different vectors of uniform variables). Then we drew on the same graph the solutions of equations (11-12) with respect to l i and l s with varying k for δ =0.1. As we can see, for a given value of k the intercepts of the above curves with a vertical line with abscissa k/20 determine an interval containing almost all intercepts of the frets with the same line. Fig. 2. Generating 0.9 confidence intervals for the mean P of a Bernoulli random variable with population and sample of n=200 and m=20 elements, respectively. φ=k/m = frequency of ones in the sample; ψ=h/n = frequency of ones in the population. Fret lines: trajectories described by the number of 1 in sample and population when p ranges from 0 to 1, for different sets of initial uniform random variables. Curves: trajectories described by the interval extremes when the observed number k of 1 in the sample ranges from 0 to m. Fig. 3. Functionally linked variables: Circle h describing the sample and possible circles describing the population. Small diamonds and circles: sampled points; Line filled region: symmetric difference. 2 Algorithmic inference of Boolean functions In the PAC learning framework the parameter under investigation is the probability that the inferred function will compute erroneously on next inputs (will not explain new sampled points). In greater detail, the general form of the sample is Z m ={(X i, b i ), i=1,..., m} (13) where b i are Boolean variables. If we assume that for every M and every Z M an f exists in a Boolean class C, call it c, such that Z M ={(X i, c(x i )), i=1,..., M}, then we are interested in the measure of the symmetric difference between another function computed from Z m, that we denote as hypothesis h, and any such c (see Fig. 3). The peculiarity of this inference problem is that some degrees of freedom of our sample are burned by the functional links on the labels. Namely, let us denote by U c h the measure of the above symmetric difference: for a given z m this is the random variable corresponding to the parameter R(α) for a suitable mapping from {h} to Δ. Then the twisting argument reads (with some caveats): (T ε t U c h +1) (U c h < ε) (T ε t U c h + µ) (14)

4 where t Uc h is the number of actual sample points falling in c h (the empirical risk in the Vapnik notation [8]), T ε the analogous statistic for an enlargement of c h of measure ε, and µ is a new complexity measure directly referred to C. The threshold in the left inequality is due to the fact that h is a function A of a sample specification z m in its own turn, so that if A is such that the symmetric difference grows with the set of included sample points and vice versa then (U c h <ε) implies that any enlargement region containing c h must violate the label of at least one more of the sampled points at the basis of h s computation. Quantity µ is an upper bound to the number of sample points sufficient to witness an eventual increase of U c h after a new hypothesis containing c h has been generated. Its extension to the whole class of concepts C and class of hypotheses H is called detail D C,H. Although semantically different from the VC dimension [5], when H coincides with C, this complexity index is related to the latter by the following theorem: Theorem [1]: Denoting by d VC (C) the VC dimension of a concept class C with detail D C,C, (d VC (C)-1)/176<D C,C <(d VC (C)+1) (15) Theorem [2]: Assume we are given a concept class C on a space, a sample z m drawn from Z m as in (13), a learning function 1 A:{z m } C. Consider the family of sets {c h} with c C labeling z m, h=a(z m ) and its detail D C,C =µ, misclassifying at least t and at most t points of probability π (0, 1), and denote with U c h the random variable given by the probability measure of c h and by F Uc h its c.d.f. Then for each z m and β (π, 1) I β ( 1+ t',m t' ) F U c h ( β) I β ( µ + t,m ( µ + t)+1) (16) where µ +t 1 m I β (µ + t,m (µ + t) +1) =1 β i (1 β) m i (17) i= 0 is the incomplete Beta function. Corollary: Within the same hypotheses of the above theorem: 1. the ratio between maximum and minimum numbers of examples needed to learn C with 1 Satisfying usual regularity conditions which represent the counterpart of a well behaved function [5] request. For a formal definition see [1]. accuracy parameters 0<ε<1/8, 0<δ<1/100 is bounded by a constant [1]; 2. a pair of extremes (l i, l s ) of a confidence interval of level δ for U c h is constituted respectively by the δ/2 quantile of the Beta distribution of parameters 1+t and m-t, and the analogous 1-δ/2 quantile for parameters µ+t and m-(µ+t)+1 [4]; 3. class complexity µ and hypothesis accuracy t linearly add [3]. Example [Learning rectangles]: Consider h belonging to the class of rectangles. We move from the unidimensional case of Fig. 2 to the bidimensional case depicted in Fig. 4. Here again we give label 1 to the single coordinate u j, j = 1, 2 (each ruled by a corresponding uniform random variable U j ) if it falls below a given threshold p j, label 0 otherwise. Moreover we give to the point a i of coordinates (u 1, u 2 ) a label equal to the product of the labels of the single coordinates. Thus the probability p c that a point a falls in the open rectangle c bounded by the coordinate axes and the two mentioned threshold lines (for short we will henceforth refer to these rectangles as bounds' rectangles) is p 1 p 2. Let us complicate the inference scheme in two ways: 1. We move from U j to the family of uniform random variables Z j in [0, θ] explained by the function z = θ u with θ (0, + ). 2. We maintain the same labeling rule but do not know c, i.e. the thresholds p 1 and p 2. Rather, within the class of bounds' rectangles containing all 1-labeled sample points yet excluding all 0- labeled sample points (consistent bounds' rectangles as statistics), we will identify it with the maximal one h, i.e. the one having the largest consistent edges (just before the closest 0-labeled points). Letting p' 1 and p' 2 be the length of these edges, we presently look for the probability p h = p' 1 p' 2 /θ 2 representing the asymptotic frequency with which future points (generated with the above sampling mechanism for any θ) will fall in h. We may imagine a whole family of sequences of domains B, each sequence pivoted on a possible rectangle. In some sequences the domain B p of measure p will include the pivot, in other ones will be included by it. Thus we need witnesses that our actual h computed from the actual sample constitutes a pivot included in B p. But this happens if two special points exactly the negative one preventing the rectangle expanding on the left and

5 the negative one preventing the rectangle expanding on the up are included in B p. Thus let us enrich the family of sequences having for each rectangle and each possible pair of witness points the pivot constituted by the union of the rectangle with these points. In respect to the sequence pivoted on our actual rectangle and witnesses we have that if 2 or more negative points are included in B p for sure the witness points are among them. Hence a twisting argument reads: ( p h < p ) k p k + 2 ( ) (18) where k p is still a specification of a Binomial random variable of parameters m (sample size) and p, accounting for the sample points contained in B p. From the left, let us consider the family of bounds' rectangles As θ is free, p h < p ( ) requires that there must be an enlargement of h whose measure for a proper θ exactly equals p. But since both edges of h are bounded by a negative point, this enlargement must contain at least one point more than h itself. Formally ( k p k +1) p h < p ( ) (19) Putting together the two pieces of twisting argument, we obtain the corresponding bounds on probabilities as follows: possible labelings mechanisms, we stored the sample point labelings according to all rectangles with vertices in a suitable discretization (steps of 1/10 on each edge) of the unitary hypercube. Then we forgot the source figures and for each labeling computed the maximal consistent bounds' rectangle h and we drew the graph in Fig. 4(b). Namely, we drew 10 samples as before, and for each sample labeling we reported on the graph the actual frequency φ and the probability p h (analytically computed on the basis of the rectangle coordinates) of drawing a point in Ψ belonging to the guessed maximal hypothesis. On the same graph we also reported the curves describing the course of the symmetric 0.9 confidence interval for P h with the observed frequency of falling inside the rectangle h according to (16). Finally, for comparison's sake we (a) ( ) P( P h < p ) = F Ph ( p ) P( K p k + 2) P K p k +1 (20) Generalizing our arguments to an n-dimensional space, we recognize that the number of witnessing points of expansions of the maximal hypotheses on bounds' rectangles are at most n. Hence n constitutes the detail µ of this class of hypotheses. We extended the experiment shown in Fig. 2 as follows: we built a sample of m=30 elements drawing random Y i coordinates. To stress the spread of the confidence intervals we abandoned the uniform distribution; namely we used the sampling mechanism (U, g(u)) with g(u) = u j ) for computing the j-th coordinate (namely, the first coordinate is u, the second u 2, etc.). Moreover, to mark the drift from the Bernoulli variable, we refer to a fourdimensional rectangle in Ψ=[0,1] 4, in respect to which the rectangle in Fig. 4(a) could represent just a projection. Then, to figure out a wide set of Fig. 4. Generating 0.9 confidence intervals for the probability P h of a bounds' rectangle in Ψ=[0,1] 4 from a sample of 30 elements. (a) The drawn sample and one of its possible labelings in a two-dimensional projection. Bullets: 1-labeled (positive) sampled points; diamonds: 0-labeled (negative) sampled points. (b) Points: curves of the frequency φ and probability p h of falling inside a bounds' rectangle with a close lattice of labeling functions. Dashed curves: trajectories described by the confidence interval extremes with reference to µ =1. Plain curves: trajectories described by the confidence interval extremes with reference to µ =4. (b)

6 also drew the curves obtained for a pure Bernoulli variable. In spite of some apparent unbalancing in the figure, the percentages of points falling out of upper and lower bound curves are approximately equal, 3.32 and 3.67% respectively. Thus satisfying the 5% upper bounds used for drawing these curves. The analogous percentages, 16.2 and 0.28%, denote the inadequacy of the curves drawn for the Bernoulli distribution. The smaller values of actual versus allowed bounds trespassers can be attributed to the worst case duty of our curves: i.e. they must guarantee a given confidence whatever the underlying distribution law is. The above check on domain measures is the key action of any PAC learning task, where confidence intervals like the ones in Fig. 4(b) are the ultimate probabilistic learning target. Indeed, the distinguishing features of the above case study are the following: We are building a domain h on the basis of the sampled coordinates (besides the a priori specifications that the rectangle left-lower vertex coincides with the axes origin and edges orientation are parallel to them). Though coming from independent U i 's, some sample points, those labeled by 1, share the fact of being all inside h, and those labeled by 0 vice versa. The above twisting argument holds whatever the coordinates joint distribution law is. The experiment in the figure confirms that the sole probabilistic consequence of these additional features in comparison to the original case study of Fig. 2 is in the bounds of the confidence region, now pushed up by the fact that four points in place of one need to witness the inclusion of h in a proper domain of measure p ~ and at least one is additionally included in an enlargement of h with this measure. 3 Conclusions PAC learning represents a very innovative perspective in inferential statistics which relates the randomness of the sample data to the mutual structure deriving from their syntactical properties. We set up an inferential mechanism and a structural complexity index to stress this idea. Assuming a source of uniformly random data, we want to discover the function g θ mapping these data into random variables of a given and possibly unknown distribution law. In the case of Bernoulli variables whose values are related to ancillary data, a prominent part of our inference may lie in fixing this relation, which is exactly an instance of the problem of learning Boolean functions. In the paper we show the benefit of this contrivance in terms of its capability of studying the distribution law of the error risk, in connection with specific features of the computed hypothesis such as the detail of its class. From an operational viewpoint this benefit translates in a more favourable relation between sample size and accuracy parameters ε and δ. In particular we obtain a definite narrowing of the confidence intervals of the error risk in respect to those usually computed in the Vapnik approach. In a more philosophical perspective, our approach provides a clear statistical rationale to the learning theory putting the premises for new development of this theory. References: [1] B. Apolloni and S. Chiaravalli, PAC learning of concept classes through the boundaries of their items, Theoretical Computer Science 172, 1997, [2] B. Apolloni, E. Esposito, D. Malchiodi, C. Orovas, G. Palmas and J. G. Taylor, A General Framework for Learning Rules from Data, IEEE Transactions on Neural Networks, 2004, to appear. [3] B. Apolloni and D. Malchiodi, Gaining degrees of freedom in subsymbolic learning, Theoretical Computer Science 255, 2001, [4] B. Apolloni, D. Malchiodi and S. Gaito, Algorithmic Inference in Machine Learning, International Series on Advanced Intelligence Vol. 5, Advanced Knowledge International, Magill, Adelaide, [5] A. Blumer, A. Ehrenfreucht, D. Haussler, M. Warmuth, Learnability and the Vapnik- Chervonenkis Dimension, Journal of the ACM 36, 1989, [6] J. W. Tukey, Non-parametric estimation II. Statistically equivalent blocks and tolerance regions the continuous case, Annals of Mathematical Statistics 18, 1947, [7] L. G. Valiant, A theory of the learnable, Communications of the ACM 11 (27), 1984, [8] V. Vapnik, Statistical Learning Theory, John Wiley, New York, 1998.

Narrowing confidence interval width of PAC learning risk function by algorithmic inference

Narrowing confidence interval width of PAC learning risk function by algorithmic inference Bruno Apolloni, Dario Malchiodi Dip. di Scienze dell Informazione, Università degli Studi di Milano Via Comelico