New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

Similar documents
1 Proof of learning bounds

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Kernel Methods and Support Vector Machines

Understanding Machine Learning Solution Manual

1 Generalization bounds based on Rademacher complexity

Computable Shell Decomposition Bounds

1 Bounding the Margin

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Computable Shell Decomposition Bounds

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

1 Rademacher Complexity Bounds

Feature Extraction Techniques

Computational and Statistical Learning Theory

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Support Vector Machines. Maximizing the Margin

Combining Classifiers

Improved Guarantees for Agnostic Learning of Disjunctions

3.8 Three Types of Convergence

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Computational and Statistical Learning Theory

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

List Scheduling and LPT Oliver Braun (09/05/2017)

Pattern Recognition and Machine Learning. Artificial Neural networks

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Non-Parametric Non-Line-of-Sight Identification 1

CS Lecture 13. More Maximum Likelihood

arxiv: v1 [cs.ds] 3 Feb 2014

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Pattern Recognition and Machine Learning. Artificial Neural networks

a a a a a a a m a b a b

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Testing Properties of Collections of Distributions

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

Fixed-to-Variable Length Distribution Matching

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

A Simple Regression Problem

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

The Transactional Nature of Quantum Information

Support Vector Machines MIT Course Notes Cynthia Rudin

Bayes Decision Rule and Naïve Bayes Classifier

Efficient Learning with Partially Observed Attributes

Robustness and Regularization of Support Vector Machines

A Theoretical Framework for Deep Transfer Learning

PAC-Bayes Analysis Of Maximum Entropy Learning

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

Support recovery in compressed sensing: An estimation theoretic approach

On Constant Power Water-filling

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Estimating Parameters for a Gaussian pdf

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

Lecture 21. Interior Point Methods Setup and Algorithm

The Weierstrass Approximation Theorem

Homework 3 Solutions CSE 101 Summer 2017

Learnability and Stability in the General Learning Setting

Estimating Average-Case Learning Curves Using Bayesian, Statistical Physics and VC Dimension Methods

Stability Bounds for Non-i.i.d. Processes

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

Stochastic Subgradient Methods

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks

16 Independence Definitions Potential Pitfall Alternative Formulation. mcs-ftl 2010/9/8 0:40 page 431 #437

A Theoretical Analysis of a Warm Start Technique

Block designs and statistics

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval

Polygonal Designs: Existence and Construction

SPECTRUM sensing is a core concept of cognitive radio

1 Identical Parallel Machines

arxiv: v1 [cs.ds] 17 Mar 2016

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

VC Dimension and Sauer s Lemma

New Slack-Monotonic Schedulability Analysis of Real-Time Tasks on Multiprocessors

Prediction by random-walk perturbation

Multiple Instance Learning with Query Bags

Ensemble Based on Data Envelopment Analysis

Support Vector Machines. Goals for the lecture

arxiv: v3 [cs.lg] 7 Jan 2016

A theory of learning from different domains

1 Proving the Fundamental Theorem of Statistical Learning

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

Boosting with log-loss

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

In this chapter, we consider several graph-theoretic and probabilistic models

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Machine Learning Basics: Estimators, Bias and Variance

Probability Distributions

A note on the multiplication of sparse matrices

A Note on the Applied Use of MDL Approximations

A1. Find all ordered pairs (a, b) of positive integers for which 1 a + 1 b = 3

Transcription:

JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University of California at Santa Cruz Philip M. Long plong@google.co Google Abstract We study learning of initial intervals in the prediction odel. We show that for each distribution D over the doain, there is an algorith A D, whose probability of a istake in round is at ost ( 1 + o(1)) 1. We also show that the best possible bound that can be achieved in( the case in) which the sae algorith A ust be applied for all distributions D is at least 1 1 e o(1) > ( 3 5 o(1)) 1. Inforally, knowing the distribution D enables an algorith to reduce its error rate by a constant factor strictly greater than 1. As advocated by Ben-David et al. (8), knowledge of D can be viewed as an idealized proxy for a large nuber of unlabeled exaples. Keywords: Prediction odel, initial intervals, sei-supervised learning, error bounds. 1. Introduction Where to place a decision boundary between a cloud of negative exaples and a cloud of positive exaples is a core and fundaental issue in achine learning. Learning theory provides soe guidance on this question, but gaps in our knowledge persist even in the ost basic and idealized foralizations of this proble. Arguably the ost basic such foralization is the learning of initial intervals in the prediction odel (Haussler et al., 199). Each concept in the class is described by a threshold θ, and an instance x R is labeled + if x θ and otherwise. The learning algorith A is given a labeled -saple {(x 1, y 1 ),..., (x, y )} where y i is the label of x i. The algorith ust then predict a label ŷ for a test point x. The x 1,..., x and x are drawn independently at rando fro an arbitrary, unknown probability distribution D. Let opt() be the optial error probability guarantee using exaples in this odel. The best previously known bounds (Haussler et al., 199) were ( ) 1 1 o(1) opt() (1 + o(1)) 1. (1) To our knowledge, this factor of gap has persisted for nearly two decades. The proof of the lower bound of (1) uses a specific choice of D, no atter what the target. Thus, it also lower bounds the best possible error probability guarantee for algoriths that are given the distribution D as well as the saple. The upper bound holds for a particular algorith and all distributions D, so it is also an upper bound on the best possible error probability guarantee when the algorith does not know D. c 1 D.P. Helbold & P.M. Long.

Helbold Long The increasing availability of unlabeled data has inspired uch recent research on the question of how to use such data, and the liits on its usefulness. Ben-David et al. (8) proposed knowledge of the distribution D as a clean, if idealized, proxy for access to large nubers of unlabeled exaples. Since the lower bound in (1) did not exploit the algorith s lack of a priori knowledge of D, iproved lower bounds exploiting this lack of knowledge ay also shed light on the utility of unlabeled data. This paper studies how knowledge of D affects the error rate when learning initial intervals in the prediction odel. Our positive result is an algorith that, when given D, has an error rate at ost ( 1 + o(1)) 1. This atches the lower bound of (1) up to lower order ters. As a copleentary negative result, we show that any prediction algorith ( without ) prior knowledge of D can be forced to have an error probability at least 1 e o(1) 1 ( > 3 5 o(1)) 1. Thus not knowing D leads to at least a % increase in the probability of aking a istake (when the target and distribution are chosen adversarially). A third result shows that the axiu argin algorith can be forced to have the even higher error rate (1 o(1)) 1. The training data reduces the version space (the region of potential values of θ) to an interval between the greatest positive exaple and the least negative exaple. Furtherore, all exaples outside this region are classified correctly by any θ in the version space. Our algorith achieving the ( 1 + o(1)) 1 error probability protects against the worst case by choosing a hypothesis ˆθ in the iddle (with respect to the distribution D) of this region of uncertainty. This can be viewed as a D-weighted halving algorith and follows the general principle of getting in the iddle of the version space (Herbrich et al., 1; Kääriäinen, 5). ( ) As has becoe coon since (Ehrenfeucht et al., 1989), our 1 e o(1) 1 lower bound proceeds by choosing θ and D randoly and analyzing the error rate of the resulting Bayes optial algorith. The distribution D in our construction concentrates a oderate aount q of probability very close to one side or the other of the decision boundary. If D is unknown, and no exaples are seen fro the accuulation point (likely if q is not too large), the algorith cannot reliably get into the iddle of the version space. Unlike ost analyses showing the benefits of sei-supervised learning that rely on an assuption of sparsity near the decision boundary, our analysis uses distributions that are peaked at the decision boundary. Related work. Learning fro a labeled saple and additional unlabeled exaples is called sei-supervised learning. Sei-supervised learning is an active and diverse research area; see standard texts like (Chapelle et al., 6) and (Zhu and Goldberg, 9) for ore inforation. Our work builds ost directly on the work of Ben-David et al. (8). They proposed using knowledge of D as a proxy for access to a very large nuber of unlabeled exaples, and considered how this affects the saple coplexity of learning in the PAC odel, which is closely related to the error rate in the prediction odel (Haussler et al., 199). Their ain results concerned liitations on the ipact of the knowledge of D; Darnstädt and Sion (11) extended this line of research, also deonstrating such liitations. In contrast, the thrust of our ain result is the opposite, that knowledge of D gives at least a 16% reduction in the (worst-case) prediction error rate. Balcan and Blu (1) introduced a fraework

New Bounds for Intervals to analyze cases in which partial knowledge of relationship between the distribution and the target classifier can iprove error rate, whereas the focus of this papes to study the benefits of knowing D even potentially in the absence of a relationship between D and the target. Kääriäinen (5) analyzed algoriths that use unlabeled data to estiate the etric ρ D (f, g) def = Pr x D (f(x) g(x)), and then choose a hypothesis at the center of the version space of classifiers agreeing with the labels on the training exaples. Our algorith for learning given knowledge of D follows this philosophy. Urner et al. (11) proposed a fraework to analyze cases in which unlabeled data can help to train a classifier that can be evaluated ore efficiently. The upper bound of (1) is a consequence of a ore general bound in ters of the VCdiension. Li et al. (1) showed that the leading constant in the general bound cannot be iproved, even in the case that the VC-diension is 1. However their construction uses tree-structured classes that are ore coplicated than the initial intervals studied here.. Further Preliinaries and Main Results For any particular D and θ, the expected error of Algorith A, Err (A; D, θ), is the probability that its prediction is not the correct label of x, where the probability is over the + 1 rando draws fro D and any randoization perfored by A. We are interested in the worst-case error of the best algorith: if the algorith can depend on D, this is and, if not, Our ain lower bound is the following. opt D () = inf sup Err (A; D, θ), A θ opt() = inf sup Err (A; D, θ). A D,θ Theore 1 opt() ( 1 e o(1) ) 1 ( 3 5 o(1)) 1. This eans that for every algorith learning initial intervals, there is a distribution D and threshold θ such that the algorith s ( istake ) probability (after seeing exaples, but not the distribution D) is at least 1 e o(1) 1. In our proof, distribution D depends on. Our ain upper bound is the following. Theore For all probability distributions D, opt D () ( 1 + o(1)) 1. We show Theore by analyzing an algorith that gets into the iddle of the version space with respect to the given distribution D. When there are both positive and negative exaples, a axiu argin algorith (Vapnik and Lerner, 1963; Boser et al., 199) akes its prediction using a hypothesized threshold ˆθ that is halfway between the greatest positive exaple, and the least negative exaple. Theore 3 For any axiu argin algorith A, there is a D and a θ such that Err (A MM ; D, θ) (1 o(1)) 1. 3

Helbold Long Our construction chooses D and θ as functions of. Note that in the case of intervals, the axiu argin algorith is siilar to the (un-weighted) halving algorith. Throughout we use U to denote the unifor distribution on the open interval (, 1). 3. Proof of Theore 1 As entioned in the introduction, we will choose the target θ and the distribution D randoly, and prove a lower bound on the Bayes optial algorith when D and θ are chosen in this way. No algorith A can do better on average over the rando choice of θ and D than the Bayes optial algorith that knows the distributions over D and θ. This in turn iplies that for any algorith A there exist a particular θ and D for which the lower bound holds for A. [D: rephrased the the second sentence above to clarify Bayes Optial] Let q = c/ for a c [, ) to be chosen later. Define distribution D θ,q (x) as the following ixture: with probability p = 1 q, x is drawn fro U, the unifor distribution on (, 1). with probability q = c/, x = θ. We will analyze the following: 1. Fix the saple size.. Draw θ fro U and set the target to be (, θ] with probability 1/, and (, θ) with probability 1/. 3. Draw an -saple S = {x 1,..., x } fro Dθ,q. Extend S to the extended saple, S + by adding x = and x +1 = 1 to S. The labeled saple L = {(x i, y i )} where x i S + and y i is the label of x i given by the target (either (, θ) or (, θ]).. Draw a final test point x also iid fro D θ,q. [P: can be copressed] This setting, which we call the open-closed experient, is not legal, because it soeties uses open intervals as targets. However, we will now show that a lower bound for this setting iplies a siilar lower bound when only closed initial intervals are used. We define the legal experient as above, except that instead of using the open target (, θ), the adversary uses target (, θ 1 3 ]. Lea For any algorith A and any nuber of exaples, let p legal be the probability that A akes a istake in the legal experient, and p oc be the probability that A akes a istake in the open-closed experient. Then p legal p oc +1 3. Proof. If none of the training or test exaples falls (strictly) between θ 1 and θ, then 3 the training and test data are the sae in both experients. The probability that this happens is at least 1 +1. 3 Since a (C o(1))/ lower bound for the open-closed experient iplies such a bound for the legal experient, we can concentrate on the open-closed experient. We now consider the following events. istake is the event that the Bayes Optial classifier predicts incorrectly. zero is the event that no x i S equals θ. one is the event that exactly one x i S equals θ.

New Bounds for Intervals Now, using these events, Pr(istake) Pr(zero) Pr(istake zero) + Pr(one) Pr(istake one). () 3.1. Event zero, no x i S is equal to θ. First we lower bound the probability of zero. ( ) Lea 5 Pr(zero) = (1 c/) e c 1 c. Proof: The draws fro D θ,q are independent so for q = c/ we have ( Pr(zero) ) = p = (1 c/). Furtherore, ln ((1 c/) ) = ln(1 c/) c c = c c and exp( c /) 1 c /. Now, we lower bound the probability of a istake, given zero. In this case, the x i s in S and θ are all drawn i.i.d. fro the unifor distribution on (, 1), and with probability one the x i are all different fro θ and each other. Since all perutations are equally likely, with probability 1 /( + 1), there is an x i S saller than θ and an x j S greater than θ. Let x + be the largest x i S + saller than θ, and let x be sallest x i S + greater than θ. (Recall that instances and 1 have been added to S + so x + and x are well-defined.) Lea 6 Assue event zero and let x + be largest positive point in S + and x be the sallest negative point in S +. After conditioning on the labeled saple L, if the test point is sapled fro D θ,q (independent of S + ) then the error probability of the Bayes Optial predictos Pr(istake x +, x ) = q + p ( x x + ) = c + ( c)(x x + ). Proof: After conditioning on L and event zero, θ is uniforly distributed on (x +, x ). The label of test point x is known whenever x x + or x x. Only when x + < x < x can the Bayes optial predictor ake a istake. The Bayes optial predictor predicts + on x if x < (x + + x )/, predicts on x if x > (x + + x )/, and predicts arbitrarily when x = (x + + x )/. Therefore, for a given value of θ, when the test point x is drawn fro D θ,q, the probability of istake is q/ (for the fraction of tie that x = θ) plus p x ++x for the chance that x is drawn fro U and falls between θ and the idpoint of (x +, x ). Pr(istake x +, x ) = x + +x x + x + ( ( q + p x+ + x ( q + p x + +x = q + p (x x + ) as desired. The following lea is due to Moran (197). )) ( θ x + + x dp (θ θ [x +, x ]) )) dp (θ θ [x +, x ]) 5

Helbold Long Lea 7 (Moran, 197) Assue a set {x 1,..., x } of 3 points are drawn iid fro the unifor distribution on the unit interval. Relabel these points so that x 1 x x and set x = and x +1 =1. For the + 1 gap lengths defined by g i = x i+1 x i for i, we have E ( i= g i ) = +. Note: if the set of points is drawn fro a circle, then the first point can be taken as the endpoint, splitting the circle into an interval. This leads to the following, which we will find useful later. Lea 8 The expected su of squared arc-lengths when the circle of circuference 1 is partitioned by rando points fro the unifor distribution is /( + 1). Coing back to our lower bound proof, we are now ready to prove a lower bound for the zero case. Lea 9 For the Bayes Optial Predictor, Pr(istake zero) = c + 1 ( + ) c ( + ). Proof: Given the event zero, each x i in S is drawn iid fro the unifor distribution on (, 1). We find it convenient to relabel the x i S in sorted order so that x 1 x x. Note that x = and x +1 = 1 in S + are defined consistently with this sorted order. To siplify the notation, we leave the conditioning on event zero iplicit in the reainder of the proof. Pr(istake) = Pr(istake S, θ) dθ dp (S) (3) = i= Pr(θ (x i, x i+1 ) S) Pr(istake θ (x i, x i+1 ), S) dp (S) () Since the threshold θ is drawn fro U on (, 1), Pr(θ (x i, x i+1 ) S) = x i+1 x i. With an application of Lea 6 we get: Pr(θ (x i, x i+1 ) S) Pr(istake θ (x i, x i+1 ), S) i= = Substituting this into (), ( q (x i+1 x i ) + p(x ) i+1 x i ) = q + p i= Pr(istake) = ( q + p i= (x i+1 x i ). i= ) (x i+1 x i ) dp (S) (5) [ = q + p ] E S U (x i+1 x i ) = q + p ( + ) using Lea 7 to evaluate the expectation in (6). Replacing q by c/ and p by 1 c/ gives the desired result. i= (6) (7) 6

New Bounds for Intervals 3.. Event one, exactly one x i S is equal to θ. We lower bound the second ter on the RHS of () by bounding Pr(one) and Pr(istake one). ( ) Lea 1 Pr(one) = q(1 q) 1 ce c 1 + c c c3 Proof: Pr(one) = q(1 q) 1 = c(1 c/) 1 c/ ce c (1 c ( /) ce c 1 + c ) c 1 c/ c3, where Lea 5 was used to bound (1 c/) and the last inequality used the fact that 1/(1 z) 1 + z for all z [, 1). Lea 11 Pr(istake one) 1 1 +1 = 1 O(1/ ). Proof: Since θ is drawn fro the unifor distribution, after conditioning on the event one, the x i in S are i.i.d. fro the unifor distribution on [, 1]. Again consider the points relabeled in ascending order. Each of the x i S is equally likely to be θ, and the x i = θ is equally likely to be labeled + or, giving equally likely possibilities. As before, define x + and x to be the largest x i S + labeled + and the sallest x i S + labeled respectively. We proceed assuing the Bayes Optial Algorith knows that one of the x i = θ, i.e. that event one occurred. (This can only reduce its error probability.) To siplify the notation, we leave the conditioning on event one iplicit in the reainder of the proof. If there are both positive and negative exaples, so that there is a greatest positive exaple x + and a least negative exaples x, there are two possibilities: either the target is (, x + ] ot is (, x ). In either case, the entire open interval between x + and x shares the sae label, and since the two cases are equally likely that label is equally likely to either + or. Therefore: Pr(istake) = 1 1 Pr(x i = x + S) Pr(istake S, x + = x i ) d Pr(S) (8) ( ) xi+1 x i Pr(x i = x + S) d Pr(S). (9) We consider the su in ore detail. Note that x i can be x + when either x i = θ and is labeled + or x i+1 = θ and is labeled. 1 ( ) xi+1 x i Pr(x i = x + S) = 1 1 ( ) xi+1 x i = 1 (x x 1 ). Plugging it into (9) gives: Pr(istake) = 1 E S U [x x 1 ] = 1 expected length of each issing end-interval is 1/( + 1). 1 +1 since the 7

Helbold Long 3.3. Putting it together Cobining Lea 5, Lea 9, Lea 1, and Lea 11 we get that Pr(istake zero) Pr(zero) + Pr(istake one) Pr(one) is at least ( ) ( ) c + 1 ( + ) c e c 1 c + 1 (1 ( + ) ( + 1) ce c + c ) c c3 = 1 + c e c O(1/ ). Setting c = 1/, the axiizer of (1 + c)e c, the bound becoes 1/( e) O(1/ ), copleting the proof of Theore 1.. Proof of Theore Here we show that knowledge of D can be exploited by a axiu-argin-in-probability algorith to achieve prediction error probability at ost (1/ + o(1))/. The arginal distribution D and target threshold θ are chosen adversarially, but the algorith is given distribution D as well as the training saple. The first step is to show that we can assue without loss of generality that D is the unifor distribution U over (, 1). [D: I assue there is no proble changing [, 1] to (, 1) above] This is a slight generalization of the rescaling trick of Ben-David et al. (8). [D: replaced odification with generalization is that OK?] Lea 1 opt D () opt U (). Proof Our proof is through a prediction-preserving reduction (Pitt and Waruth, 199). This consists of a (possibly randoized) instance transforation φ and a target transforation ψ. In this proof φ aps R into [, 1], and ψ aps a threshold in R to a new threshold in [, 1]. Iplicit in the analysis of Pitt and Waruth (199) is the observation that, if φ(x) ψ(θ) x θ for all training and test exaples x, then an algorith A t with prediction error bound b t for the transfored proble can be used to solve the original proble. By feeding A t the training data φ(x 1 ),..., φ(x ) and the test point φ(x), and using A t s prediction (of whether φ(x) ψ(θ) or not) one gets an algorith with prediction error bound b t for the original proble. When D has a density, φ(x) = Pr z D (z x) and ψ(θ) = Pr z D (z θ). In this case the distribution φ(x) is unifor over (, 1), and since φ and ψ are identical and onotone, φ(x) ψ(θ) x θ. If D has a one or ore accuulation points, then, for each accuulation point x a, we choose φ(x) uniforly fro the interval (Pr z D (z < x a ), Pr z D (z x a )]. We still set ψ(θ) = Pr z D (z θ) everywhere. For this transforation, the probability distribution over φ(x) is still unifor over (, 1), and, as before, if x θ, of x is not an accuulation point, then φ(x) ψ(θ) x θ. When x a = θ for an accuulation point x a, φ(x a ) ψ(θ), since ψ(θ) is set to be the right endpoint of (Pr z D (z < x a ), Pr z D (z x a )]. Thus, overall, φ(x) ψ(θ) x θ, and the probability of a istake in the original proble is bounded by the istake probability with respect to the transfored proble. 8

New Bounds for Intervals So now we are faced with the subproble of learning initial intervals in the case that D is the unifor distribution U over (, 1). The algorith that we analyze for this proble is the following axiu argin algorith A MM : if the training data includes both positive and negative exaples, it predicts using a threshold halfway between the greatest positive exaple and the least negative exaple. If all of the exaples are negative, it uses a threshold halfway between the least negative exaple and, and if all of the exaples are positive, A MM uses a threshold halfway between the greatest positive exaple and 1. The basic idea exploits the fact that the unifor distribution is invariant to horizontal shifts to average over rando shifts. We define x s x + s x + s to be addition odulo 1, so that, intuitively, x s is obtained by starting at x, and walking s units to the right while wrapping around to whenever 1 is reached. We extend the notation to sets in the natural way: if T [, 1) then T s = {t s : t T }. Fix an arbitrary target threshold θ (, 1) and also fix (for now) an arbitrary set S = {x 1,..., x } of training points (whose labels are deterined by θ). Renuber the points in S so that x 1 x x. To siplify soe expressions, we use both x 1 and x +1 to refer to x 1. For each x, s [, 1], let error(s, x) be the {, 1}-valued indicator for whether A MM akes a istake when trained with S s and tested with x. Let = E x U (error(s, x)), and let error = E s U (). For 1 i <, let G i = [x i, x i+1 ) be the points in the interval between x i and x i+1, and let G = [x, 1) [, x 1 ), so the G i partition [, 1). Let R i = {s [, 1) : θ G i s} and notice that the R i also partition [, 1). We have error = 1 ds = i s R i ds. We now consider s R i ds in ore detail. This integral corresponds to the situation where the shifted saple causes θ to fall in the gap between x i and x i+1. Depending on the location of θ and the length of the gap, the shifted interval ight extend past either or 1 while containing θ, and thus wrap around to the other side of the unit interval. Let the gap length g i be x i+1 x i (or 1 x + x 1 if i = ), and let be the shift taking x i+1 to θ, so x i+1 = θ. Note that a shift of + g i takes x i to θ even though + g i ight be greater than one, which happens when θ [x i, x i+1 ). We will now give a proof-by-plot that s R i ds gi /. For an algebraic proof, see Appendix A. [D: Case c does not yet have an algebraic proof.] We begin by assuing θ 1/ since the situation with θ > 1/ is syetrical. The cases we consider depend on the relationship between g i and θ: case (A) g i θ, case (B) θ < g i 1, and case (C) 1 < g i. Cases (B) and (C) have two sub-cases depending on whether or not θ g i /. The figure for each case shows several shiftings of the interval, and plots as a function of s. The predictions of A MM on the shifted interval are indicated, and the region where A MM akes a prediction erros shaded. The top shifting in each case is actually for s = plus soe sall ɛ (so that θ < x i+1 s), although it is labeled as s = for siplicity. Note that in each case, the plot of lies within two triangles with height g i /. Therefore the integrals r 1 +g i ds are at ost gi /. 9

Helbold Long Case (A): g i θ. The [x i, x i+1 ) s interval intersects θ only when s + x i θ < s + x i+1. x i s = + x i+1 g i / s = + g i / + s = + g i + θ g i / g i / s + g i Case (B), subcase (1): g i + θ 1 and θ < g i /. Note that the dashed part of the interval is actually shifted to the right-edge of [, 1). s = +/ s = + θ + s = + g i + g i / s = + g i + g i / θ g i / θ/ ri + θ + g i s Case (B), subcase (): g i + θ 1 and g i / < θ < g i. Again, the dashed part of the interval is actually shifted to right-edge of [, 1). s = + θ/ s = + g i + g i / s = + g i / + s = + g i + g i / θ g i / θ/ s + g i Case (C), subcase (1): θ < g i /, and θ +g i > 1. On the left: shifting the interval between x i and x i+1 to intersect θ. In this case not only can part of the interval be shifted to right-edge of [, 1), but part of the interval can also extend beyond 1 (and be shifted to the left-edge of [, 1)) while the shifted interval contains θ. s = +/ s = + θ + s = + g i + g i / s = + 1 + 1 gi s = + g i + (1 )/ θ 1 The correctness of the plot relies on the bup in the plot at s = +1 θ never rising outside of the triangle. We have redrawn the triangle in ore detail to the right (dropping the subscripts), and relabeled the x-axis (in blue). This shows us that at the bup, the dotted line of the triangle is at height g 1 θ g θ while the bup is at height 1 g/. g i / θ/ ri + θ + g i s g 1 θ g θ r + 1 r + θ r + g {}}{ 1 g g/ 1 g/ 1

New Bounds for Intervals Solving g 1 θ g θ of this case. Therefore the difference g 1 θ = 1 g/ for g yields g = θ and g = 1, exactly the boundaries g θ 1 g/ (which is the aount by which the boundary of the triangle lies above the bup) does not change sign. When g = 5/6 and θ = /6 the difference is 1/36, so the difference reains non-negative for all g i and θ in this case. Case (C), subcase (): θ > g i /, and θ +g i > 1. On the left: shifting the interval between x i and x i+1 to intersect θ. Again, parts of the interval can be shifted across and across 1 while the shifted interval contains θ. s = + θ/ s = + g i + g i / s = + g i / + s = + 1 + 1 gi s = + g i + (1 )/ θ 1 We now have, for any θ [, 1], g i / θ/ s + g i Pr S U MM incorrect),x U = E S U E s U,x U (A MM (S s, x) 1 θ (x)) E S U gi 1 / + where the last inequality uses Lea 8. This copletes the proof of Theore. 5. Proof of Theore 3 In this section we show that any axiu argin algorith can be forced to have an error probability (1 o(1))/ when it is not given knowledge of D (i.e. without transforing the input as in the previous section). This is a factor of worse than our upper bound for an algorith that uses knowledge D to axiize a probability-weighted argin. Let c be a positive even integer (by choosing a large constant value for c, our lower bound will get arbitrarily close to 1 ). For a given training set size, consider the set T = { 3 c,..., 3, 3 1} containing the first c powers of 1/3. Fix distribution D to be the unifor distribution over T. Fix the target threshold to be θ = 3 c/ 1, so that half the points in T are labeled positively (recall that all points less than or equal to the threshold are labeled positively). Maxiu argin algoriths can ake different predictions only when all the exaples have the sae label. For this choice of a distribution D and a target θ, the probability that all the exaples have the sae label is. Thus for large enough, the difference between axiu argin algoriths is negligible. Fro here on, let us consider the algorith A MM, defined earlier, that adds two artificial exaples, (, +) and (1, ), and predicts using the axiu argin classifier on the resulting input. For 1 i c/, let T i be the i points in T just above the threshold: if l = c/ 1 = log 3 θ then T i = { 3 l+1, 3 l+,..., 3 l+i}. Let event Miss i be the event that none of the 11

Helbold Long training points are in T i. Fo < c/, let event Exact i be the event that both (a) none of the training points are in T i, and (b) soe training point is in T i+1 (i.e. soe training point is 3 l+i+1 ). Let Exact c/ be the event that no training point is labeled. Therefore the Exact i events are disjoint and Miss i = j i Exact j. Note that if Exact i occurs, then the sallest negative exaple is 3 l+i+1. Furtherore, all points in T i are less then half this value and the axiu argin algorith predicts incorrectly on exactly the i points in T i, so Pr(error Exact i ) = i/(c). Thus, for > c, we have Pr(error) = = 1 c/ c c/ Pr(error Exact i ) Pr(Exact i ) = Pr(Miss i ) = 1 c c/ ( c i c ) 1 c c/ c i c Pr(Exact i) ( 1 i/c ). ( ) Fo c, in the liit as, 1 i/c exp( i/c), so for large enough (large relative to the constant c) we can continue as follows. Pr(error) 1 c = 1 ɛ c c (1 ɛ) exp( i/c) = 1 ɛ c c exp( 1/c)c exp( 1/c)1 1 exp( 1/c) = 1 ɛ c exp( 1/c) = (1 ɛ)(1 ɛ ) c 1 exp( 1/c) exp( 1/c) i exp( 1/c) 1 exp( c) 1 exp( 1/c) where ɛ = e c. Now, replacing 1/c by a we get: Pr(error) (1 ɛ)(1 ɛ ) a exp( a) 1 exp( a) Using L Hopitals rule, we see that the liit of the second fraction as a is 1. So for large enough c, the second fraction is at least 1 ɛ 3 and Pr(error) (1 ɛ)(1 ɛ )(1 ɛ 3 ). Thus, by aking the constant c large enough, and choosing large enough copared to c, the expected error of the axiu argin algorith can be ade arbitrarily close to 1/. 6. Conclusion Algoriths that know the underlying arginal distribution D over the instances can learn significantly ore accurately than algoriths that do not. Since knowledge of D has been proposed as a proxy for a large nuber of unlabeled exaples, our results indicate a benefit for sei-supervised learning. It is particularly intriguing that our analysis shows the benefit of sei-supervised learning when the distribution is nearly unifor, but slightly concentrated near the decision boundary. This is in sharp contrast to previous analyses showing the benefits of sei-supervised learning, which typically rely on a cluster assuption postulating that exaples are sparse along the decision boundary. 1

New Bounds for Intervals References M. Balcan and A. Blu. A discriinative odel for sei-supervised learning. JACM, 57 (3), 1. S. Ben-David, T. Lu, and D. Pal. Does unlabeled data provably help? Worst-case analysis of the saple coplexity of sei-supervised learning. COLT, 8. B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorith for optial argin classifiers. Proceedings of the 199 Workshop on Coputational Learning Theory, pages 1 15, 199. Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. Sei-Supervised Learning. MIT Press, 6. Malte Darnstädt and Hans-Ulrich Sion. Sart pac-learners. Theor. Coput. Sci., 1 (19):1756 1766, 11. A. Ehrenfeucht, D. Haussler, M. Kearns, and L. G. Valiant. A general lower bound on the nuber of exaples needed for learning. Inforation and Coputation, 8(3):7 51, 1989. D. Haussler, N. Littlestone, and M. K. Waruth. Predicting {, 1}-functions on randoly drawn points. Inforation and Coputation, 115():19 161, 199. R. Herbrich, T. Graepel, and C. Capbell. Bayes point achines. Journal of Machine Learning Research, 1:5 79, 1. M. Kääriäinen. Generalization error bounds using unlabeled data. COLT, 5. Y. Li, P. M. Long, and A. Srinivasan. The one-inclusion graph algorith is near-optial for the prediction odel of learning. IEEE Transactions on Inforation Theory, 7(3): 157 161, 1. P.A.P. Moran. The rando division of an interval. Suppleent to the Journal of the Royal Statistical Society, 9(1):9 98, 197. L. Pitt and M. K. Waruth. Prediction preserving reducibility. Journal of Coputer and Syste Sciences, 1(3), 199. R. Urner, S. Shalev-Shwartz, and S. Ben-David. Access to unlabeled data can speed up prediction tie. ICML, 11. V. N. Vapnik and A. Lerner. Pattern recognition using generalized portrait ethod. Autoation and reote control,, 1963. Xiaojin Zhu and Andrew B. Goldberg. Introduction to Sei-Supervised Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan-Claypool, 9. 13

Helbold Long Appendix A. Algebraic copletion of the proof of Theore Here we give an algebraic proof that s R i ds gi /, whose proof was sketched using plots in Section. This integral corresponds to the situation where the shifted saple causes θ to fall in the gap between x i and x i+1. We again assue that θ 1/ (the other case is syetrical) and proceed using the sae cases. As before, let be the shift taking x i+1 to θ and g i be the length of the x i, x i+1 gap. Thus x i+1 = θ and a shift of + g i takes x i to θ even though + g i ight be greater than one. Case (A): g i θ. In this case, θ falls in the gap only for shifts s where x i s θ < x i+1 s. The axiu argin algorith akes a istake on a randoly drawn test point exactly when the test point is between the iddle of the (shifted) gap and θ. Therefore, ( ) ds = xi + x i+1 gi / s ds = z dz = g i. Case (B): θ g i 1. In this case, as before, the integral goes fro to + g i. However, the expected erros slightly ore coplicated: while s [, +g i θ], all of the exaples are negative, and the expected erros x i+1 s, and when s [ +g i θ, +g i ] the expected erros x i+1+x i s, see the Case B plots in Section. Thus: ds = θ x i+1 s ds + +g i θ x i+1 s + x i s ds. Using a change of variables (t = s ), we get gi θ ds = θ + t gi dt + θ + t + θ g i + t g i θ dt gi θ = t gi dt + t g i dt. (1) g i θ Subcase (B1): g i θ. Continuing fro Equation (1), ds = θ θ t dt + gi θ θ t dt + gi g i θ g i t dt = θ (g i ) + (g i ) + θg i g i + (g i ) = g i + 7θ 3g iθ < gi /. Subcase (B): θ < g i < θ. Again continuing fro Equation (1), ds = gi θ = θ(g i ) = θg i θ t dt + gi / g i θ (g i ) < g i g i t dt + gi g i / t g i dt + (θ g i )g i g i 8 + (g i ) g i + g i g i 8 1

New Bounds for Intervals since θg i, as a function of θ, is nondecreasing on the interval (g i /, g i ), and therefore at ost gi over that interval. Case (C): g i 1. When s [, + g i ], all of the exaples are negative (as in case (B)) and when s ( + 1, + g i ) all the exaples are positive. This partitions the shifts in (, + g i ) into three parts (see the plots in Section ). Initially θ falls in the gap between and the shifted x i+1. Then x i shifts in and the θ is in the gap between the shifted x i and x i+1. Finally, x i+1 wraps around and θ is in the gap between the shifted x i and 1. Thus +g i ds equals θ x i+1 s ri +1 θ ds+ +g i θ x i+1 s + x i s ds+ +1 θ x i s + 1 ds Using the substitution t = s and following case B this becoes gi θ ds = t 1 θ dt+ t g gi i dt+ θ g i + t + 1 g i θ 1 θ dt (11) Subcase (C1): g i θ, so 1 g i θ g i /. Continuing fro (11), +g i ds equals θ θ t gi θ t dt + θ 1 θ dt + t g gi i g i θ dt + t + 1 g i 1 θ = θ + (g i ) + (1 g i)(1 ) + gi(1 ) gi = g + θ + θ gθ 1/. dt 3(1 ) Note that the second derivative w.r.t. θ is positive, so the r.h.s. is axiized when θ = 1 g i or θ = g i /. In both cases, it is easily verified that the value is at ost gi /. Subcase (C): g i θ. Continuing fro (11) and following the logic of case B, gi θ ds equals θ t gi / g 1 θ i dt + g i θ t dt + t g gi i g i / dt + t + 1 g i 1 θ = (3θ g i)(g i ) + (θ g i) 8 = g i + θ 1 gi. + ( g i ) 8 dt + (g i + θ 1)(3 g i 3θ) This is increasing in θ, and thus axiized when θ = 1/ where it becoes g i / gi / 1/8. We want to show that this bound is at ost gi /, i.e. that f(g i) = g i / gi / 1/8. Cobining the facts that g i 1/ in Case C, f(1/) =, and f (g i ) when g i 1/, gives the desired inequality. 15