New Bounds for Learning Intervals with Implications for Semi-Supervised Learning

Size: px
Start display at page:

Download "New Bounds for Learning Intervals with Implications for Semi-Supervised Learning"

Transcription

1 JMLR: Workshop and Conference Proceedings vol (1) 1 15 New Bounds for Learning Intervals with Iplications for Sei-Supervised Learning David P. Helbold dph@soe.ucsc.edu Departent of Coputer Science, University of California at Santa Cruz Philip M. Long plong@google.co Google Abstract We study learning of initial intervals in the prediction odel. We show that for each distribution D over the doain, there is an algorith A D, whose probability of a istake in round is at ost ( 1 + o(1)) 1. We also show that the best possible bound that can be achieved in( the case in) which the sae algorith A ust be applied for all distributions D is at least 1 1 e o(1) > ( 3 5 o(1)) 1. Inforally, knowing the distribution D enables an algorith to reduce its error rate by a constant factor strictly greater than 1. As advocated by Ben-David et al. (8), knowledge of D can be viewed as an idealized proxy for a large nuber of unlabeled exaples. Keywords: Prediction odel, initial intervals, sei-supervised learning, error bounds. 1. Introduction Where to place a decision boundary between a cloud of negative exaples and a cloud of positive exaples is a core and fundaental issue in achine learning. Learning theory provides soe guidance on this question, but gaps in our knowledge persist even in the ost basic and idealized foralizations of this proble. Arguably the ost basic such foralization is the learning of initial intervals in the prediction odel (Haussler et al., 199). Each concept in the class is described by a threshold θ, and an instance x R is labeled + if x θ and otherwise. The learning algorith A is given a labeled -saple {(x 1, y 1 ),..., (x, y )} where y i is the label of x i. The algorith ust then predict a label ŷ for a test point x. The x 1,..., x and x are drawn independently at rando fro an arbitrary, unknown probability distribution D. Let opt() be the optial error probability guarantee using exaples in this odel. The best previously known bounds (Haussler et al., 199) were ( ) 1 1 o(1) opt() (1 + o(1)) 1. (1) To our knowledge, this factor of gap has persisted for nearly two decades. The proof of the lower bound of (1) uses a specific choice of D, no atter what the target. Thus, it also lower bounds the best possible error probability guarantee for algoriths that are given the distribution D as well as the saple. The upper bound holds for a particular algorith and all distributions D, so it is also an upper bound on the best possible error probability guarantee when the algorith does not know D. c 1 D.P. Helbold & P.M. Long.

2 Helbold Long The increasing availability of unlabeled data has inspired uch recent research on the question of how to use such data, and the liits on its usefulness. Ben-David et al. (8) proposed knowledge of the distribution D as a clean, if idealized, proxy for access to large nubers of unlabeled exaples. Since the lower bound in (1) did not exploit the algorith s lack of a priori knowledge of D, iproved lower bounds exploiting this lack of knowledge ay also shed light on the utility of unlabeled data. This paper studies how knowledge of D affects the error rate when learning initial intervals in the prediction odel. Our positive result is an algorith that, when given D, has an error rate at ost ( 1 + o(1)) 1. This atches the lower bound of (1) up to lower order ters. As a copleentary negative result, we show that any prediction algorith ( without ) prior knowledge of D can be forced to have an error probability at least 1 e o(1) 1 ( > 3 5 o(1)) 1. Thus not knowing D leads to at least a % increase in the probability of aking a istake (when the target and distribution are chosen adversarially). A third result shows that the axiu argin algorith can be forced to have the even higher error rate (1 o(1)) 1. The training data reduces the version space (the region of potential values of θ) to an interval between the greatest positive exaple and the least negative exaple. Furtherore, all exaples outside this region are classified correctly by any θ in the version space. Our algorith achieving the ( 1 + o(1)) 1 error probability protects against the worst case by choosing a hypothesis ˆθ in the iddle (with respect to the distribution D) of this region of uncertainty. This can be viewed as a D-weighted halving algorith and follows the general principle of getting in the iddle of the version space (Herbrich et al., 1; Kääriäinen, 5). ( ) As has becoe coon since (Ehrenfeucht et al., 1989), our 1 e o(1) 1 lower bound proceeds by choosing θ and D randoly and analyzing the error rate of the resulting Bayes optial algorith. The distribution D in our construction concentrates a oderate aount q of probability very close to one side or the other of the decision boundary. If D is unknown, and no exaples are seen fro the accuulation point (likely if q is not too large), the algorith cannot reliably get into the iddle of the version space. Unlike ost analyses showing the benefits of sei-supervised learning that rely on an assuption of sparsity near the decision boundary, our analysis uses distributions that are peaked at the decision boundary. Related work. Learning fro a labeled saple and additional unlabeled exaples is called sei-supervised learning. Sei-supervised learning is an active and diverse research area; see standard texts like (Chapelle et al., 6) and (Zhu and Goldberg, 9) for ore inforation. Our work builds ost directly on the work of Ben-David et al. (8). They proposed using knowledge of D as a proxy for access to a very large nuber of unlabeled exaples, and considered how this affects the saple coplexity of learning in the PAC odel, which is closely related to the error rate in the prediction odel (Haussler et al., 199). Their ain results concerned liitations on the ipact of the knowledge of D; Darnstädt and Sion (11) extended this line of research, also deonstrating such liitations. In contrast, the thrust of our ain result is the opposite, that knowledge of D gives at least a 16% reduction in the (worst-case) prediction error rate. Balcan and Blu (1) introduced a fraework

3 New Bounds for Intervals to analyze cases in which partial knowledge of relationship between the distribution and the target classifier can iprove error rate, whereas the focus of this papes to study the benefits of knowing D even potentially in the absence of a relationship between D and the target. Kääriäinen (5) analyzed algoriths that use unlabeled data to estiate the etric ρ D (f, g) def = Pr x D (f(x) g(x)), and then choose a hypothesis at the center of the version space of classifiers agreeing with the labels on the training exaples. Our algorith for learning given knowledge of D follows this philosophy. Urner et al. (11) proposed a fraework to analyze cases in which unlabeled data can help to train a classifier that can be evaluated ore efficiently. The upper bound of (1) is a consequence of a ore general bound in ters of the VCdiension. Li et al. (1) showed that the leading constant in the general bound cannot be iproved, even in the case that the VC-diension is 1. However their construction uses tree-structured classes that are ore coplicated than the initial intervals studied here.. Further Preliinaries and Main Results For any particular D and θ, the expected error of Algorith A, Err (A; D, θ), is the probability that its prediction is not the correct label of x, where the probability is over the + 1 rando draws fro D and any randoization perfored by A. We are interested in the worst-case error of the best algorith: if the algorith can depend on D, this is and, if not, Our ain lower bound is the following. opt D () = inf sup Err (A; D, θ), A θ opt() = inf sup Err (A; D, θ). A D,θ Theore 1 opt() ( 1 e o(1) ) 1 ( 3 5 o(1)) 1. This eans that for every algorith learning initial intervals, there is a distribution D and threshold θ such that the algorith s ( istake ) probability (after seeing exaples, but not the distribution D) is at least 1 e o(1) 1. In our proof, distribution D depends on. Our ain upper bound is the following. Theore For all probability distributions D, opt D () ( 1 + o(1)) 1. We show Theore by analyzing an algorith that gets into the iddle of the version space with respect to the given distribution D. When there are both positive and negative exaples, a axiu argin algorith (Vapnik and Lerner, 1963; Boser et al., 199) akes its prediction using a hypothesized threshold ˆθ that is halfway between the greatest positive exaple, and the least negative exaple. Theore 3 For any axiu argin algorith A, there is a D and a θ such that Err (A MM ; D, θ) (1 o(1)) 1. 3

4 Helbold Long Our construction chooses D and θ as functions of. Note that in the case of intervals, the axiu argin algorith is siilar to the (un-weighted) halving algorith. Throughout we use U to denote the unifor distribution on the open interval (, 1). 3. Proof of Theore 1 As entioned in the introduction, we will choose the target θ and the distribution D randoly, and prove a lower bound on the Bayes optial algorith when D and θ are chosen in this way. No algorith A can do better on average over the rando choice of θ and D than the Bayes optial algorith that knows the distributions over D and θ. This in turn iplies that for any algorith A there exist a particular θ and D for which the lower bound holds for A. [D: rephrased the the second sentence above to clarify Bayes Optial] Let q = c/ for a c [, ) to be chosen later. Define distribution D θ,q (x) as the following ixture: with probability p = 1 q, x is drawn fro U, the unifor distribution on (, 1). with probability q = c/, x = θ. We will analyze the following: 1. Fix the saple size.. Draw θ fro U and set the target to be (, θ] with probability 1/, and (, θ) with probability 1/. 3. Draw an -saple S = {x 1,..., x } fro Dθ,q. Extend S to the extended saple, S + by adding x = and x +1 = 1 to S. The labeled saple L = {(x i, y i )} where x i S + and y i is the label of x i given by the target (either (, θ) or (, θ]).. Draw a final test point x also iid fro D θ,q. [P: can be copressed] This setting, which we call the open-closed experient, is not legal, because it soeties uses open intervals as targets. However, we will now show that a lower bound for this setting iplies a siilar lower bound when only closed initial intervals are used. We define the legal experient as above, except that instead of using the open target (, θ), the adversary uses target (, θ 1 3 ]. Lea For any algorith A and any nuber of exaples, let p legal be the probability that A akes a istake in the legal experient, and p oc be the probability that A akes a istake in the open-closed experient. Then p legal p oc Proof. If none of the training or test exaples falls (strictly) between θ 1 and θ, then 3 the training and test data are the sae in both experients. The probability that this happens is at least Since a (C o(1))/ lower bound for the open-closed experient iplies such a bound for the legal experient, we can concentrate on the open-closed experient. We now consider the following events. istake is the event that the Bayes Optial classifier predicts incorrectly. zero is the event that no x i S equals θ. one is the event that exactly one x i S equals θ.

5 New Bounds for Intervals Now, using these events, Pr(istake) Pr(zero) Pr(istake zero) + Pr(one) Pr(istake one). () 3.1. Event zero, no x i S is equal to θ. First we lower bound the probability of zero. ( ) Lea 5 Pr(zero) = (1 c/) e c 1 c. Proof: The draws fro D θ,q are independent so for q = c/ we have ( Pr(zero) ) = p = (1 c/). Furtherore, ln ((1 c/) ) = ln(1 c/) c c = c c and exp( c /) 1 c /. Now, we lower bound the probability of a istake, given zero. In this case, the x i s in S and θ are all drawn i.i.d. fro the unifor distribution on (, 1), and with probability one the x i are all different fro θ and each other. Since all perutations are equally likely, with probability 1 /( + 1), there is an x i S saller than θ and an x j S greater than θ. Let x + be the largest x i S + saller than θ, and let x be sallest x i S + greater than θ. (Recall that instances and 1 have been added to S + so x + and x are well-defined.) Lea 6 Assue event zero and let x + be largest positive point in S + and x be the sallest negative point in S +. After conditioning on the labeled saple L, if the test point is sapled fro D θ,q (independent of S + ) then the error probability of the Bayes Optial predictos Pr(istake x +, x ) = q + p ( x x + ) = c + ( c)(x x + ). Proof: After conditioning on L and event zero, θ is uniforly distributed on (x +, x ). The label of test point x is known whenever x x + or x x. Only when x + < x < x can the Bayes optial predictor ake a istake. The Bayes optial predictor predicts + on x if x < (x + + x )/, predicts on x if x > (x + + x )/, and predicts arbitrarily when x = (x + + x )/. Therefore, for a given value of θ, when the test point x is drawn fro D θ,q, the probability of istake is q/ (for the fraction of tie that x = θ) plus p x ++x for the chance that x is drawn fro U and falls between θ and the idpoint of (x +, x ). Pr(istake x +, x ) = x + +x x + x + ( ( q + p x+ + x ( q + p x + +x = q + p (x x + ) as desired. The following lea is due to Moran (197). )) ( θ x + + x dp (θ θ [x +, x ]) )) dp (θ θ [x +, x ]) 5

6 Helbold Long Lea 7 (Moran, 197) Assue a set {x 1,..., x } of 3 points are drawn iid fro the unifor distribution on the unit interval. Relabel these points so that x 1 x x and set x = and x +1 =1. For the + 1 gap lengths defined by g i = x i+1 x i for i, we have E ( i= g i ) = +. Note: if the set of points is drawn fro a circle, then the first point can be taken as the endpoint, splitting the circle into an interval. This leads to the following, which we will find useful later. Lea 8 The expected su of squared arc-lengths when the circle of circuference 1 is partitioned by rando points fro the unifor distribution is /( + 1). Coing back to our lower bound proof, we are now ready to prove a lower bound for the zero case. Lea 9 For the Bayes Optial Predictor, Pr(istake zero) = c + 1 ( + ) c ( + ). Proof: Given the event zero, each x i in S is drawn iid fro the unifor distribution on (, 1). We find it convenient to relabel the x i S in sorted order so that x 1 x x. Note that x = and x +1 = 1 in S + are defined consistently with this sorted order. To siplify the notation, we leave the conditioning on event zero iplicit in the reainder of the proof. Pr(istake) = Pr(istake S, θ) dθ dp (S) (3) = i= Pr(θ (x i, x i+1 ) S) Pr(istake θ (x i, x i+1 ), S) dp (S) () Since the threshold θ is drawn fro U on (, 1), Pr(θ (x i, x i+1 ) S) = x i+1 x i. With an application of Lea 6 we get: Pr(θ (x i, x i+1 ) S) Pr(istake θ (x i, x i+1 ), S) i= = Substituting this into (), ( q (x i+1 x i ) + p(x ) i+1 x i ) = q + p i= Pr(istake) = ( q + p i= (x i+1 x i ). i= ) (x i+1 x i ) dp (S) (5) [ = q + p ] E S U (x i+1 x i ) = q + p ( + ) using Lea 7 to evaluate the expectation in (6). Replacing q by c/ and p by 1 c/ gives the desired result. i= (6) (7) 6

7 New Bounds for Intervals 3.. Event one, exactly one x i S is equal to θ. We lower bound the second ter on the RHS of () by bounding Pr(one) and Pr(istake one). ( ) Lea 1 Pr(one) = q(1 q) 1 ce c 1 + c c c3 Proof: Pr(one) = q(1 q) 1 = c(1 c/) 1 c/ ce c (1 c ( /) ce c 1 + c ) c 1 c/ c3, where Lea 5 was used to bound (1 c/) and the last inequality used the fact that 1/(1 z) 1 + z for all z [, 1). Lea 11 Pr(istake one) = 1 O(1/ ). Proof: Since θ is drawn fro the unifor distribution, after conditioning on the event one, the x i in S are i.i.d. fro the unifor distribution on [, 1]. Again consider the points relabeled in ascending order. Each of the x i S is equally likely to be θ, and the x i = θ is equally likely to be labeled + or, giving equally likely possibilities. As before, define x + and x to be the largest x i S + labeled + and the sallest x i S + labeled respectively. We proceed assuing the Bayes Optial Algorith knows that one of the x i = θ, i.e. that event one occurred. (This can only reduce its error probability.) To siplify the notation, we leave the conditioning on event one iplicit in the reainder of the proof. If there are both positive and negative exaples, so that there is a greatest positive exaple x + and a least negative exaples x, there are two possibilities: either the target is (, x + ] ot is (, x ). In either case, the entire open interval between x + and x shares the sae label, and since the two cases are equally likely that label is equally likely to either + or. Therefore: Pr(istake) = 1 1 Pr(x i = x + S) Pr(istake S, x + = x i ) d Pr(S) (8) ( ) xi+1 x i Pr(x i = x + S) d Pr(S). (9) We consider the su in ore detail. Note that x i can be x + when either x i = θ and is labeled + or x i+1 = θ and is labeled. 1 ( ) xi+1 x i Pr(x i = x + S) = 1 1 ( ) xi+1 x i = 1 (x x 1 ). Plugging it into (9) gives: Pr(istake) = 1 E S U [x x 1 ] = 1 expected length of each issing end-interval is 1/( + 1) since the 7

8 Helbold Long 3.3. Putting it together Cobining Lea 5, Lea 9, Lea 1, and Lea 11 we get that Pr(istake zero) Pr(zero) + Pr(istake one) Pr(one) is at least ( ) ( ) c + 1 ( + ) c e c 1 c + 1 (1 ( + ) ( + 1) ce c + c ) c c3 = 1 + c e c O(1/ ). Setting c = 1/, the axiizer of (1 + c)e c, the bound becoes 1/( e) O(1/ ), copleting the proof of Theore 1.. Proof of Theore Here we show that knowledge of D can be exploited by a axiu-argin-in-probability algorith to achieve prediction error probability at ost (1/ + o(1))/. The arginal distribution D and target threshold θ are chosen adversarially, but the algorith is given distribution D as well as the training saple. The first step is to show that we can assue without loss of generality that D is the unifor distribution U over (, 1). [D: I assue there is no proble changing [, 1] to (, 1) above] This is a slight generalization of the rescaling trick of Ben-David et al. (8). [D: replaced odification with generalization is that OK?] Lea 1 opt D () opt U (). Proof Our proof is through a prediction-preserving reduction (Pitt and Waruth, 199). This consists of a (possibly randoized) instance transforation φ and a target transforation ψ. In this proof φ aps R into [, 1], and ψ aps a threshold in R to a new threshold in [, 1]. Iplicit in the analysis of Pitt and Waruth (199) is the observation that, if φ(x) ψ(θ) x θ for all training and test exaples x, then an algorith A t with prediction error bound b t for the transfored proble can be used to solve the original proble. By feeding A t the training data φ(x 1 ),..., φ(x ) and the test point φ(x), and using A t s prediction (of whether φ(x) ψ(θ) or not) one gets an algorith with prediction error bound b t for the original proble. When D has a density, φ(x) = Pr z D (z x) and ψ(θ) = Pr z D (z θ). In this case the distribution φ(x) is unifor over (, 1), and since φ and ψ are identical and onotone, φ(x) ψ(θ) x θ. If D has a one or ore accuulation points, then, for each accuulation point x a, we choose φ(x) uniforly fro the interval (Pr z D (z < x a ), Pr z D (z x a )]. We still set ψ(θ) = Pr z D (z θ) everywhere. For this transforation, the probability distribution over φ(x) is still unifor over (, 1), and, as before, if x θ, of x is not an accuulation point, then φ(x) ψ(θ) x θ. When x a = θ for an accuulation point x a, φ(x a ) ψ(θ), since ψ(θ) is set to be the right endpoint of (Pr z D (z < x a ), Pr z D (z x a )]. Thus, overall, φ(x) ψ(θ) x θ, and the probability of a istake in the original proble is bounded by the istake probability with respect to the transfored proble. 8

9 New Bounds for Intervals So now we are faced with the subproble of learning initial intervals in the case that D is the unifor distribution U over (, 1). The algorith that we analyze for this proble is the following axiu argin algorith A MM : if the training data includes both positive and negative exaples, it predicts using a threshold halfway between the greatest positive exaple and the least negative exaple. If all of the exaples are negative, it uses a threshold halfway between the least negative exaple and, and if all of the exaples are positive, A MM uses a threshold halfway between the greatest positive exaple and 1. The basic idea exploits the fact that the unifor distribution is invariant to horizontal shifts to average over rando shifts. We define x s x + s x + s to be addition odulo 1, so that, intuitively, x s is obtained by starting at x, and walking s units to the right while wrapping around to whenever 1 is reached. We extend the notation to sets in the natural way: if T [, 1) then T s = {t s : t T }. Fix an arbitrary target threshold θ (, 1) and also fix (for now) an arbitrary set S = {x 1,..., x } of training points (whose labels are deterined by θ). Renuber the points in S so that x 1 x x. To siplify soe expressions, we use both x 1 and x +1 to refer to x 1. For each x, s [, 1], let error(s, x) be the {, 1}-valued indicator for whether A MM akes a istake when trained with S s and tested with x. Let = E x U (error(s, x)), and let error = E s U (). For 1 i <, let G i = [x i, x i+1 ) be the points in the interval between x i and x i+1, and let G = [x, 1) [, x 1 ), so the G i partition [, 1). Let R i = {s [, 1) : θ G i s} and notice that the R i also partition [, 1). We have error = 1 ds = i s R i ds. We now consider s R i ds in ore detail. This integral corresponds to the situation where the shifted saple causes θ to fall in the gap between x i and x i+1. Depending on the location of θ and the length of the gap, the shifted interval ight extend past either or 1 while containing θ, and thus wrap around to the other side of the unit interval. Let the gap length g i be x i+1 x i (or 1 x + x 1 if i = ), and let be the shift taking x i+1 to θ, so x i+1 = θ. Note that a shift of + g i takes x i to θ even though + g i ight be greater than one, which happens when θ [x i, x i+1 ). We will now give a proof-by-plot that s R i ds gi /. For an algebraic proof, see Appendix A. [D: Case c does not yet have an algebraic proof.] We begin by assuing θ 1/ since the situation with θ > 1/ is syetrical. The cases we consider depend on the relationship between g i and θ: case (A) g i θ, case (B) θ < g i 1, and case (C) 1 < g i. Cases (B) and (C) have two sub-cases depending on whether or not θ g i /. The figure for each case shows several shiftings of the interval, and plots as a function of s. The predictions of A MM on the shifted interval are indicated, and the region where A MM akes a prediction erros shaded. The top shifting in each case is actually for s = plus soe sall ɛ (so that θ < x i+1 s), although it is labeled as s = for siplicity. Note that in each case, the plot of lies within two triangles with height g i /. Therefore the integrals r 1 +g i ds are at ost gi /. 9

10 Helbold Long Case (A): g i θ. The [x i, x i+1 ) s interval intersects θ only when s + x i θ < s + x i+1. x i s = + x i+1 g i / s = + g i / + s = + g i + θ g i / g i / s + g i Case (B), subcase (1): g i + θ 1 and θ < g i /. Note that the dashed part of the interval is actually shifted to the right-edge of [, 1). s = +/ s = + θ + s = + g i + g i / s = + g i + g i / θ g i / θ/ ri + θ + g i s Case (B), subcase (): g i + θ 1 and g i / < θ < g i. Again, the dashed part of the interval is actually shifted to right-edge of [, 1). s = + θ/ s = + g i + g i / s = + g i / + s = + g i + g i / θ g i / θ/ s + g i Case (C), subcase (1): θ < g i /, and θ +g i > 1. On the left: shifting the interval between x i and x i+1 to intersect θ. In this case not only can part of the interval be shifted to right-edge of [, 1), but part of the interval can also extend beyond 1 (and be shifted to the left-edge of [, 1)) while the shifted interval contains θ. s = +/ s = + θ + s = + g i + g i / s = gi s = + g i + (1 )/ θ 1 The correctness of the plot relies on the bup in the plot at s = +1 θ never rising outside of the triangle. We have redrawn the triangle in ore detail to the right (dropping the subscripts), and relabeled the x-axis (in blue). This shows us that at the bup, the dotted line of the triangle is at height g 1 θ g θ while the bup is at height 1 g/. g i / θ/ ri + θ + g i s g 1 θ g θ r + 1 r + θ r + g {}}{ 1 g g/ 1 g/ 1

11 New Bounds for Intervals Solving g 1 θ g θ of this case. Therefore the difference g 1 θ = 1 g/ for g yields g = θ and g = 1, exactly the boundaries g θ 1 g/ (which is the aount by which the boundary of the triangle lies above the bup) does not change sign. When g = 5/6 and θ = /6 the difference is 1/36, so the difference reains non-negative for all g i and θ in this case. Case (C), subcase (): θ > g i /, and θ +g i > 1. On the left: shifting the interval between x i and x i+1 to intersect θ. Again, parts of the interval can be shifted across and across 1 while the shifted interval contains θ. s = + θ/ s = + g i + g i / s = + g i / + s = gi s = + g i + (1 )/ θ 1 We now have, for any θ [, 1], g i / θ/ s + g i Pr S U MM incorrect),x U = E S U E s U,x U (A MM (S s, x) 1 θ (x)) E S U gi 1 / + where the last inequality uses Lea 8. This copletes the proof of Theore. 5. Proof of Theore 3 In this section we show that any axiu argin algorith can be forced to have an error probability (1 o(1))/ when it is not given knowledge of D (i.e. without transforing the input as in the previous section). This is a factor of worse than our upper bound for an algorith that uses knowledge D to axiize a probability-weighted argin. Let c be a positive even integer (by choosing a large constant value for c, our lower bound will get arbitrarily close to 1 ). For a given training set size, consider the set T = { 3 c,..., 3, 3 1} containing the first c powers of 1/3. Fix distribution D to be the unifor distribution over T. Fix the target threshold to be θ = 3 c/ 1, so that half the points in T are labeled positively (recall that all points less than or equal to the threshold are labeled positively). Maxiu argin algoriths can ake different predictions only when all the exaples have the sae label. For this choice of a distribution D and a target θ, the probability that all the exaples have the sae label is. Thus for large enough, the difference between axiu argin algoriths is negligible. Fro here on, let us consider the algorith A MM, defined earlier, that adds two artificial exaples, (, +) and (1, ), and predicts using the axiu argin classifier on the resulting input. For 1 i c/, let T i be the i points in T just above the threshold: if l = c/ 1 = log 3 θ then T i = { 3 l+1, 3 l+,..., 3 l+i}. Let event Miss i be the event that none of the 11

12 Helbold Long training points are in T i. Fo < c/, let event Exact i be the event that both (a) none of the training points are in T i, and (b) soe training point is in T i+1 (i.e. soe training point is 3 l+i+1 ). Let Exact c/ be the event that no training point is labeled. Therefore the Exact i events are disjoint and Miss i = j i Exact j. Note that if Exact i occurs, then the sallest negative exaple is 3 l+i+1. Furtherore, all points in T i are less then half this value and the axiu argin algorith predicts incorrectly on exactly the i points in T i, so Pr(error Exact i ) = i/(c). Thus, for > c, we have Pr(error) = = 1 c/ c c/ Pr(error Exact i ) Pr(Exact i ) = Pr(Miss i ) = 1 c c/ ( c i c ) 1 c c/ c i c Pr(Exact i) ( 1 i/c ). ( ) Fo c, in the liit as, 1 i/c exp( i/c), so for large enough (large relative to the constant c) we can continue as follows. Pr(error) 1 c = 1 ɛ c c (1 ɛ) exp( i/c) = 1 ɛ c c exp( 1/c)c exp( 1/c)1 1 exp( 1/c) = 1 ɛ c exp( 1/c) = (1 ɛ)(1 ɛ ) c 1 exp( 1/c) exp( 1/c) i exp( 1/c) 1 exp( c) 1 exp( 1/c) where ɛ = e c. Now, replacing 1/c by a we get: Pr(error) (1 ɛ)(1 ɛ ) a exp( a) 1 exp( a) Using L Hopitals rule, we see that the liit of the second fraction as a is 1. So for large enough c, the second fraction is at least 1 ɛ 3 and Pr(error) (1 ɛ)(1 ɛ )(1 ɛ 3 ). Thus, by aking the constant c large enough, and choosing large enough copared to c, the expected error of the axiu argin algorith can be ade arbitrarily close to 1/. 6. Conclusion Algoriths that know the underlying arginal distribution D over the instances can learn significantly ore accurately than algoriths that do not. Since knowledge of D has been proposed as a proxy for a large nuber of unlabeled exaples, our results indicate a benefit for sei-supervised learning. It is particularly intriguing that our analysis shows the benefit of sei-supervised learning when the distribution is nearly unifor, but slightly concentrated near the decision boundary. This is in sharp contrast to previous analyses showing the benefits of sei-supervised learning, which typically rely on a cluster assuption postulating that exaples are sparse along the decision boundary. 1

13 New Bounds for Intervals References M. Balcan and A. Blu. A discriinative odel for sei-supervised learning. JACM, 57 (3), 1. S. Ben-David, T. Lu, and D. Pal. Does unlabeled data provably help? Worst-case analysis of the saple coplexity of sei-supervised learning. COLT, 8. B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorith for optial argin classifiers. Proceedings of the 199 Workshop on Coputational Learning Theory, pages 1 15, 199. Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien. Sei-Supervised Learning. MIT Press, 6. Malte Darnstädt and Hans-Ulrich Sion. Sart pac-learners. Theor. Coput. Sci., 1 (19): , 11. A. Ehrenfeucht, D. Haussler, M. Kearns, and L. G. Valiant. A general lower bound on the nuber of exaples needed for learning. Inforation and Coputation, 8(3):7 51, D. Haussler, N. Littlestone, and M. K. Waruth. Predicting {, 1}-functions on randoly drawn points. Inforation and Coputation, 115():19 161, 199. R. Herbrich, T. Graepel, and C. Capbell. Bayes point achines. Journal of Machine Learning Research, 1:5 79, 1. M. Kääriäinen. Generalization error bounds using unlabeled data. COLT, 5. Y. Li, P. M. Long, and A. Srinivasan. The one-inclusion graph algorith is near-optial for the prediction odel of learning. IEEE Transactions on Inforation Theory, 7(3): , 1. P.A.P. Moran. The rando division of an interval. Suppleent to the Journal of the Royal Statistical Society, 9(1):9 98, 197. L. Pitt and M. K. Waruth. Prediction preserving reducibility. Journal of Coputer and Syste Sciences, 1(3), 199. R. Urner, S. Shalev-Shwartz, and S. Ben-David. Access to unlabeled data can speed up prediction tie. ICML, 11. V. N. Vapnik and A. Lerner. Pattern recognition using generalized portrait ethod. Autoation and reote control,, Xiaojin Zhu and Andrew B. Goldberg. Introduction to Sei-Supervised Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan-Claypool, 9. 13

14 Helbold Long Appendix A. Algebraic copletion of the proof of Theore Here we give an algebraic proof that s R i ds gi /, whose proof was sketched using plots in Section. This integral corresponds to the situation where the shifted saple causes θ to fall in the gap between x i and x i+1. We again assue that θ 1/ (the other case is syetrical) and proceed using the sae cases. As before, let be the shift taking x i+1 to θ and g i be the length of the x i, x i+1 gap. Thus x i+1 = θ and a shift of + g i takes x i to θ even though + g i ight be greater than one. Case (A): g i θ. In this case, θ falls in the gap only for shifts s where x i s θ < x i+1 s. The axiu argin algorith akes a istake on a randoly drawn test point exactly when the test point is between the iddle of the (shifted) gap and θ. Therefore, ( ) ds = xi + x i+1 gi / s ds = z dz = g i. Case (B): θ g i 1. In this case, as before, the integral goes fro to + g i. However, the expected erros slightly ore coplicated: while s [, +g i θ], all of the exaples are negative, and the expected erros x i+1 s, and when s [ +g i θ, +g i ] the expected erros x i+1+x i s, see the Case B plots in Section. Thus: ds = θ x i+1 s ds + +g i θ x i+1 s + x i s ds. Using a change of variables (t = s ), we get gi θ ds = θ + t gi dt + θ + t + θ g i + t g i θ dt gi θ = t gi dt + t g i dt. (1) g i θ Subcase (B1): g i θ. Continuing fro Equation (1), ds = θ θ t dt + gi θ θ t dt + gi g i θ g i t dt = θ (g i ) + (g i ) + θg i g i + (g i ) = g i + 7θ 3g iθ < gi /. Subcase (B): θ < g i < θ. Again continuing fro Equation (1), ds = gi θ = θ(g i ) = θg i θ t dt + gi / g i θ (g i ) < g i g i t dt + gi g i / t g i dt + (θ g i )g i g i 8 + (g i ) g i + g i g i 8 1

15 New Bounds for Intervals since θg i, as a function of θ, is nondecreasing on the interval (g i /, g i ), and therefore at ost gi over that interval. Case (C): g i 1. When s [, + g i ], all of the exaples are negative (as in case (B)) and when s ( + 1, + g i ) all the exaples are positive. This partitions the shifts in (, + g i ) into three parts (see the plots in Section ). Initially θ falls in the gap between and the shifted x i+1. Then x i shifts in and the θ is in the gap between the shifted x i and x i+1. Finally, x i+1 wraps around and θ is in the gap between the shifted x i and 1. Thus +g i ds equals θ x i+1 s ri +1 θ ds+ +g i θ x i+1 s + x i s ds+ +1 θ x i s + 1 ds Using the substitution t = s and following case B this becoes gi θ ds = t 1 θ dt+ t g gi i dt+ θ g i + t + 1 g i θ 1 θ dt (11) Subcase (C1): g i θ, so 1 g i θ g i /. Continuing fro (11), +g i ds equals θ θ t gi θ t dt + θ 1 θ dt + t g gi i g i θ dt + t + 1 g i 1 θ = θ + (g i ) + (1 g i)(1 ) + gi(1 ) gi = g + θ + θ gθ 1/. dt 3(1 ) Note that the second derivative w.r.t. θ is positive, so the r.h.s. is axiized when θ = 1 g i or θ = g i /. In both cases, it is easily verified that the value is at ost gi /. Subcase (C): g i θ. Continuing fro (11) and following the logic of case B, gi θ ds equals θ t gi / g 1 θ i dt + g i θ t dt + t g gi i g i / dt + t + 1 g i 1 θ = (3θ g i)(g i ) + (θ g i) 8 = g i + θ 1 gi. + ( g i ) 8 dt + (g i + θ 1)(3 g i 3θ) This is increasing in θ, and thus axiized when θ = 1/ where it becoes g i / gi / 1/8. We want to show that this bound is at ost gi /, i.e. that f(g i) = g i / gi / 1/8. Cobining the facts that g i 1/ in Case C, f(1/) =, and f (g i ) when g i 1/, gives the desired inequality. 15

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Journal of Machine Learning Research 5 (2004) 529-547 Subitted 1/03; Revised 8/03; Published 5/04 Coputable Shell Decoposition Bounds John Langford David McAllester Toyota Technology Institute at Chicago

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

Improved Guarantees for Agnostic Learning of Disjunctions

Improved Guarantees for Agnostic Learning of Disjunctions Iproved Guarantees for Agnostic Learning of Disjunctions Pranjal Awasthi Carnegie Mellon University pawasthi@cs.cu.edu Avri Blu Carnegie Mellon University avri@cs.cu.edu Or Sheffet Carnegie Mellon University

More information

3.8 Three Types of Convergence

3.8 Three Types of Convergence 3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to

More information

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t. CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

List Scheduling and LPT Oliver Braun (09/05/2017)

List Scheduling and LPT Oliver Braun (09/05/2017) List Scheduling and LPT Oliver Braun (09/05/207) We investigate the classical scheduling proble P ax where a set of n independent jobs has to be processed on 2 parallel and identical processors (achines)

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

Non-Parametric Non-Line-of-Sight Identification 1

Non-Parametric Non-Line-of-Sight Identification 1 Non-Paraetric Non-Line-of-Sight Identification Sinan Gezici, Hisashi Kobayashi and H. Vincent Poor Departent of Electrical Engineering School of Engineering and Applied Science Princeton University, Princeton,

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lessons 7 20 Dec 2017 Outline Artificial Neural networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

a a a a a a a m a b a b

a a a a a a a m a b a b Algebra / Trig Final Exa Study Guide (Fall Seester) Moncada/Dunphy Inforation About the Final Exa The final exa is cuulative, covering Appendix A (A.1-A.5) and Chapter 1. All probles will be ultiple choice

More information

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine CSCI699: Topics in Learning and Gae Theory Lecture October 23 Lecturer: Ilias Scribes: Ruixin Qiang and Alana Shine Today s topic is auction with saples. 1 Introduction to auctions Definition 1. In a single

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

Testing Properties of Collections of Distributions

Testing Properties of Collections of Distributions Testing Properties of Collections of Distributions Reut Levi Dana Ron Ronitt Rubinfeld April 9, 0 Abstract We propose a fraework for studying property testing of collections of distributions, where the

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

Fixed-to-Variable Length Distribution Matching

Fixed-to-Variable Length Distribution Matching Fixed-to-Variable Length Distribution Matching Rana Ali Ajad and Georg Böcherer Institute for Counications Engineering Technische Universität München, Gerany Eail: raa2463@gail.co,georg.boecherer@tu.de

More information

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and This article appeared in a ournal published by Elsevier. The attached copy is furnished to the author for internal non-coercial research and education use, including for instruction at the authors institution

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs On the Inapproxiability of Vertex Cover on k-partite k-unifor Hypergraphs Venkatesan Guruswai and Rishi Saket Coputer Science Departent Carnegie Mellon University Pittsburgh, PA 1513. Abstract. Coputing

More information

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute

More information

The Transactional Nature of Quantum Information

The Transactional Nature of Quantum Information The Transactional Nature of Quantu Inforation Subhash Kak Departent of Coputer Science Oklahoa State University Stillwater, OK 7478 ABSTRACT Inforation, in its counications sense, is a transactional property.

More information

Support Vector Machines MIT Course Notes Cynthia Rudin

Support Vector Machines MIT Course Notes Cynthia Rudin Support Vector Machines MIT 5.097 Course Notes Cynthia Rudin Credit: Ng, Hastie, Tibshirani, Friedan Thanks: Şeyda Ertekin Let s start with soe intuition about argins. The argin of an exaple x i = distance

More information

Bayes Decision Rule and Naïve Bayes Classifier

Bayes Decision Rule and Naïve Bayes Classifier Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.

More information

Efficient Learning with Partially Observed Attributes

Efficient Learning with Partially Observed Attributes Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy Shai Shalev-Shwartz The Hebrew University, Jerusale, Israel Ohad Shair The Hebrew University, Jerusale, Israel Abstract We describe and

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

A Theoretical Framework for Deep Transfer Learning

A Theoretical Framework for Deep Transfer Learning A Theoretical Fraewor for Deep Transfer Learning Toer Galanti The School of Coputer Science Tel Aviv University toer22g@gail.co Lior Wolf The School of Coputer Science Tel Aviv University wolf@cs.tau.ac.il

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13 CSE55: Randoied Algoriths and obabilistic Analysis May 6, Lecture Lecturer: Anna Karlin Scribe: Noah Siegel, Jonathan Shi Rando walks and Markov chains This lecture discusses Markov chains, which capture

More information

Support recovery in compressed sensing: An estimation theoretic approach

Support recovery in compressed sensing: An estimation theoretic approach Support recovery in copressed sensing: An estiation theoretic approach Ain Karbasi, Ali Horati, Soheil Mohajer, Martin Vetterli School of Coputer and Counication Sciences École Polytechnique Fédérale de

More information

On Constant Power Water-filling

On Constant Power Water-filling On Constant Power Water-filling Wei Yu and John M. Cioffi Electrical Engineering Departent Stanford University, Stanford, CA94305, U.S.A. eails: {weiyu,cioffi}@stanford.edu Abstract This paper derives

More information

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA) Bayesian Learning Chapter 6: Bayesian Learning CS 536: Machine Learning Littan (Wu, TA) [Read Ch. 6, except 6.3] [Suggested exercises: 6.1, 6.2, 6.6] Bayes Theore MAP, ML hypotheses MAP learners Miniu

More information

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a

More information

Estimating Parameters for a Gaussian pdf

Estimating Parameters for a Gaussian pdf Pattern Recognition and achine Learning Jaes L. Crowley ENSIAG 3 IS First Seester 00/0 Lesson 5 7 Noveber 00 Contents Estiating Paraeters for a Gaussian pdf Notation... The Pattern Recognition Proble...3

More information

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical IEEE TRANSACTIONS ON INFORMATION THEORY Large Alphabet Source Coding using Independent Coponent Analysis Aichai Painsky, Meber, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE arxiv:67.7v [cs.it] Jul

More information

Lecture 21. Interior Point Methods Setup and Algorithm

Lecture 21. Interior Point Methods Setup and Algorithm Lecture 21 Interior Point Methods In 1984, Kararkar introduced a new weakly polynoial tie algorith for solving LPs [Kar84a], [Kar84b]. His algorith was theoretically faster than the ellipsoid ethod and

More information

The Weierstrass Approximation Theorem

The Weierstrass Approximation Theorem 36 The Weierstrass Approxiation Theore Recall that the fundaental idea underlying the construction of the real nubers is approxiation by the sipler rational nubers. Firstly, nubers are often deterined

More information

Homework 3 Solutions CSE 101 Summer 2017

Homework 3 Solutions CSE 101 Summer 2017 Hoework 3 Solutions CSE 0 Suer 207. Scheduling algoriths The following n = 2 jobs with given processing ties have to be scheduled on = 3 parallel and identical processors with the objective of iniizing

More information

Learnability and Stability in the General Learning Setting

Learnability and Stability in the General Learning Setting Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu

More information

Estimating Average-Case Learning Curves Using Bayesian, Statistical Physics and VC Dimension Methods

Estimating Average-Case Learning Curves Using Bayesian, Statistical Physics and VC Dimension Methods Estiating Average-Case Learning Curves Using Bayesian, Statistical Physics and VC Diension Methods David Haussler University of California Santa Cruz, California Manfred Opper Institut fur Theoretische

More information

Stability Bounds for Non-i.i.d. Processes

Stability Bounds for Non-i.i.d. Processes tability Bounds for Non-i.i.d. Processes Mehryar Mohri Courant Institute of Matheatical ciences and Google Research 25 Mercer treet New York, NY 002 ohri@cis.nyu.edu Afshin Rostaiadeh Departent of Coputer

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 11 10/15/2008 ABSTRACT INTEGRATION I Contents 1. Preliinaries 2. The ain result 3. The Rieann integral 4. The integral of a nonnegative

More information

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES Vol. 57, No. 3, 2009 Algoriths for parallel processor scheduling with distinct due windows and unit-tie obs A. JANIAK 1, W.A. JANIAK 2, and

More information

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Smoothed Boosting Algorithm Using Probabilistic Output Codes A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay A Low-Coplexity Congestion Control and Scheduling Algorith for Multihop Wireless Networks with Order-Optial Per-Flow Delay Po-Kai Huang, Xiaojun Lin, and Chih-Chun Wang School of Electrical and Coputer

More information

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition

Upper bound on false alarm rate for landmine detection and classification using syntactic pattern recognition Upper bound on false alar rate for landine detection and classification using syntactic pattern recognition Ahed O. Nasif, Brian L. Mark, Kenneth J. Hintz, and Nathalia Peixoto Dept. of Electrical and

More information

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes Graphical Models in Local, Asyetric Multi-Agent Markov Decision Processes Ditri Dolgov and Edund Durfee Departent of Electrical Engineering and Coputer Science University of Michigan Ann Arbor, MI 48109

More information

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks

Bounds on the Minimax Rate for Estimating a Prior over a VC Class from Independent Learning Tasks Bounds on the Miniax Rate for Estiating a Prior over a VC Class fro Independent Learning Tasks Liu Yang Steve Hanneke Jaie Carbonell Deceber 01 CMU-ML-1-11 School of Coputer Science Carnegie Mellon University

More information

16 Independence Definitions Potential Pitfall Alternative Formulation. mcs-ftl 2010/9/8 0:40 page 431 #437

16 Independence Definitions Potential Pitfall Alternative Formulation. mcs-ftl 2010/9/8 0:40 page 431 #437 cs-ftl 010/9/8 0:40 page 431 #437 16 Independence 16.1 efinitions Suppose that we flip two fair coins siultaneously on opposite sides of a roo. Intuitively, the way one coin lands does not affect the way

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval

Uniform Approximation and Bernstein Polynomials with Coefficients in the Unit Interval Unifor Approxiation and Bernstein Polynoials with Coefficients in the Unit Interval Weiang Qian and Marc D. Riedel Electrical and Coputer Engineering, University of Minnesota 200 Union St. S.E. Minneapolis,

More information

Polygonal Designs: Existence and Construction

Polygonal Designs: Existence and Construction Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G

More information

SPECTRUM sensing is a core concept of cognitive radio

SPECTRUM sensing is a core concept of cognitive radio World Acadey of Science, Engineering and Technology International Journal of Electronics and Counication Engineering Vol:6, o:2, 202 Efficient Detection Using Sequential Probability Ratio Test in Mobile

More information

1 Identical Parallel Machines

1 Identical Parallel Machines FB3: Matheatik/Inforatik Dr. Syaantak Das Winter 2017/18 Optiizing under Uncertainty Lecture Notes 3: Scheduling to Miniize Makespan In any standard scheduling proble, we are given a set of jobs J = {j

More information

arxiv: v1 [cs.ds] 17 Mar 2016

arxiv: v1 [cs.ds] 17 Mar 2016 Tight Bounds for Single-Pass Streaing Coplexity of the Set Cover Proble Sepehr Assadi Sanjeev Khanna Yang Li Abstract arxiv:1603.05715v1 [cs.ds] 17 Mar 2016 We resolve the space coplexity of single-pass

More information

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information Cite as: Straub D. (2014). Value of inforation analysis with structural reliability ethods. Structural Safety, 49: 75-86. Value of Inforation Analysis with Structural Reliability Methods Daniel Straub

More information

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness A Note on Scheduling Tall/Sall Multiprocessor Tasks with Unit Processing Tie to Miniize Maxiu Tardiness Philippe Baptiste and Baruch Schieber IBM T.J. Watson Research Center P.O. Box 218, Yorktown Heights,

More information

VC Dimension and Sauer s Lemma

VC Dimension and Sauer s Lemma CMSC 35900 (Spring 2008) Learning Theory Lecture: VC Diension and Sauer s Lea Instructors: Sha Kakade and Abuj Tewari Radeacher Averages and Growth Function Theore Let F be a class of ±-valued functions

More information

New Slack-Monotonic Schedulability Analysis of Real-Time Tasks on Multiprocessors

New Slack-Monotonic Schedulability Analysis of Real-Time Tasks on Multiprocessors New Slack-Monotonic Schedulability Analysis of Real-Tie Tasks on Multiprocessors Risat Mahud Pathan and Jan Jonsson Chalers University of Technology SE-41 96, Göteborg, Sweden {risat, janjo}@chalers.se

More information

Prediction by random-walk perturbation

Prediction by random-walk perturbation Prediction by rando-walk perturbation Luc Devroye School of Coputer Science McGill University Gábor Lugosi ICREA and Departent of Econoics Universitat Popeu Fabra lucdevroye@gail.co gabor.lugosi@gail.co

More information

Multiple Instance Learning with Query Bags

Multiple Instance Learning with Query Bags Multiple Instance Learning with Query Bags Boris Babenko UC San Diego bbabenko@cs.ucsd.edu Piotr Dollár California Institute of Technology pdollar@caltech.edu Serge Belongie UC San Diego sjb@cs.ucsd.edu

More information

Ensemble Based on Data Envelopment Analysis

Ensemble Based on Data Envelopment Analysis Enseble Based on Data Envelopent Analysis So Young Sohn & Hong Choi Departent of Coputer Science & Industrial Systes Engineering, Yonsei University, Seoul, Korea Tel) 82-2-223-404, Fax) 82-2- 364-7807

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

arxiv: v3 [cs.lg] 7 Jan 2016

arxiv: v3 [cs.lg] 7 Jan 2016 Efficient and Parsionious Agnostic Active Learning Tzu-Kuo Huang Alekh Agarwal Daniel J. Hsu tkhuang@icrosoft.co alekha@icrosoft.co djhsu@cs.colubia.edu John Langford Robert E. Schapire jcl@icrosoft.co

More information

A theory of learning from different domains

A theory of learning from different domains DOI 10.1007/s10994-009-5152-4 A theory of learning fro different doains Shai Ben-David John Blitzer Koby Craer Alex Kulesza Fernando Pereira Jennifer Wortan Vaughan Received: 28 February 2009 / Revised:

More information

1 Proving the Fundamental Theorem of Statistical Learning

1 Proving the Fundamental Theorem of Statistical Learning THEORETICAL MACHINE LEARNING COS 5 LECTURE #7 APRIL 5, 6 LECTURER: ELAD HAZAN NAME: FERMI MA ANDDANIEL SUO oving te Fundaental Teore of Statistical Learning In tis section, we prove te following: Teore.

More information

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science A Better Algorith For an Ancient Scheduling Proble David R. Karger Steven J. Phillips Eric Torng Departent of Coputer Science Stanford University Stanford, CA 9435-4 Abstract One of the oldest and siplest

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis

Soft Computing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis Soft Coputing Techniques Help Assign Weights to Different Factors in Vulnerability Analysis Beverly Rivera 1,2, Irbis Gallegos 1, and Vladik Kreinovich 2 1 Regional Cyber and Energy Security Center RCES

More information

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

In this chapter, we consider several graph-theoretic and probabilistic models

In this chapter, we consider several graph-theoretic and probabilistic models THREE ONE GRAPH-THEORETIC AND STATISTICAL MODELS 3.1 INTRODUCTION In this chapter, we consider several graph-theoretic and probabilistic odels for a social network, which we do under different assuptions

More information

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Sha M Kakade Microsoft Research and Wharton, U Penn skakade@icrosoftco Varun Kanade SEAS, Harvard University vkanade@fasharvardedu

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

A note on the multiplication of sparse matrices

A note on the multiplication of sparse matrices Cent. Eur. J. Cop. Sci. 41) 2014 1-11 DOI: 10.2478/s13537-014-0201-x Central European Journal of Coputer Science A note on the ultiplication of sparse atrices Research Article Keivan Borna 12, Sohrab Aboozarkhani

More information

A Note on the Applied Use of MDL Approximations

A Note on the Applied Use of MDL Approximations A Note on the Applied Use of MDL Approxiations Daniel J. Navarro Departent of Psychology Ohio State University Abstract An applied proble is discussed in which two nested psychological odels of retention

More information

A1. Find all ordered pairs (a, b) of positive integers for which 1 a + 1 b = 3

A1. Find all ordered pairs (a, b) of positive integers for which 1 a + 1 b = 3 A. Find all ordered pairs a, b) of positive integers for which a + b = 3 08. Answer. The six ordered pairs are 009, 08), 08, 009), 009 337, 674) = 35043, 674), 009 346, 673) = 3584, 673), 674, 009 337)

More information