Structural Risk Minimization over Data-Dependent Hierarchies

Size: px
Start display at page:

Download "Structural Risk Minimization over Data-Dependent Hierarchies"

Transcription

1 Structural Risk Minimization over Data-Depenent Hierarchies John Shawe-Taylor Department of Computer Science Royal Holloway an Befor New College University of Lonon Egham, TW20 0EX, UK Peter L. Bartlett Department of Systems Engineering Australian National University Canberra 0200 Australia Robert C. Williamson Department of Engineering Australian National University Canberra 0200 Australia Martin Anthony Department of Mathematics Lonon School of Economics Houghton Street Lonon WC2A 2AE, UK November 28, 1997 Abstract The paper introuces some generalizations of Vapnik s metho of structural risk minimisation (SRM). As well as making explicit some of the etails on SRM, it provies a result that allows one to trae off errors on the training sample against improve generalization performance. It then consiers the more general case when the hierarchy of classes is chosen in response to the ata. A result is presente on the generalization performance of classifiers with a large margin. This theoretically explains the impressive generalization performance of the maximal margin hyperplane algorithm of Vapnik an co-workers (which is the basis for their support vector machines). The paper conclues with a more general result in terms of luckiness functions, which provies a quite general way for exploiting serenipitous simplicity in observe ata to obtain better preiction accuracy from small training sets. Four examples are given of such functions, incluing the VC imension measure on the sample. 1

2 Keywors: Learning Machines, Maximal Margin, Support Vector Machines, Probable Smooth Luckiness, Uniform Convergence, Vapnik-Chervonenkis Dimension, Fat Shattering Dimension, Computational Learning Theory, Probably Approximately Correct Learning. 1 Introuction The stanar Probably Approximately Correct (PAC) moel of learning consiers a fixe hypothesis class H together with a require accuracy an confience 1 ;. The theory characterises when a target function from H can be learne from examples in terms of the Vapnik- Chervonenkis imension, a measure of the flexibility of the class H an specifies sample sizes require to eliver the require accuracy with the allowe confience. In many cases of practical interest the precise class containing the target function to be learne may not be known in avance. The learner may only be given a hierarchy of classes H 1 H 2 H an be tol that the target will lie in one of the sets H. Structural Risk Minimization (SRM) copes with this problem by minimizing an upper boun on the expecte risk, over each of the hypothesis classes. The principle is a curious one in that in orer to have an algorithm it is necessary to have a goo theoretical boun on the generalization performance. A formal statement of the metho is given in the next section. Linial, Mansour an Rivest [29] stuie learning in a framework as above by allowing the learner to seek a consistent hypothesis in each subclass H in turn, rawing enough extra examples at each stage to ensure the correct level of accuracy an confience shoul a consistent hypothesis be foun. This paper 1 aresses two shortcomings of Linial et al. s approach. The first is the requirement to raw extra examples when seeking in a richer class. It may be unrealistic to assume that examples can be obtaine cheaply, an at the same time it woul be foolish not to use as many examples as are available from the start. Hence, we suppose that a fixe number of examples is allowe an that the aim of the learner is to boun the expecte generalization error with high confience. The secon rawback of the Linial et al. approach is that it is not clear how it can be aapte to hanle the case where errors are allowe on the training set. In this situation there is a nee to trae off the number of errors with the complexity of the class, since taking a class which is too complex can result in a worse generalization error (with a fixe number of examples) than allowing some extra errors in a more restricte class. The moel we consier allows a precise boun on the error arising in ifferent classes an hence a reliable way of applying the structural risk minimisation principle introuce by Vapnik [48, 50]. Inee, the results reporte in Sections 2 an 3 of this paper are implicit in the cite references, but our treatment serves to introuce the main results of the paper in later sections, an we make explicit some of the assumptions implicit in the presentations in [48, 50]. A more recent paper by Lugosi an Zeger [38] consiers stanar SRM an provies bouns for the true error of the hypothesis with lowest empirical error in each class. Whereas our Theorem 2.3 gives an error boun that ecreases to twice the empirical error roughly linearly with the ratio 1 Some of the results of this paper appeare in [43]. 2

3 of the VC imension to the number of examples, they give an error boun that ecreases to the empirical error itself, but as the square root of this ratio. From Section 4 onwars we aress a shortcoming of the SRM metho which Vapnik [48, page 161] highlights: accoring to the SRM principle the structure has to be efine a priori before the training ata appear. An algorithm using maximally separating hyperplanes propose by Vapnik [46] an co-workers [14, 16] violates this principle in that the hierarchy efine epens on the ata. In Section 4 we prove a result which shows that if one achieves correct classification of some training ata with a class of f0 1g-value functions which are threshole, an if the values of the real-value functions on the training points are all well away from zero, then there is a boun on the generalization error which can be much better than the one obtaine from the VC-imension of the threshole class. In Section 5 we apply this to the case consiere by Vapnik: separating hyperplanes with a large margin. In Section 6 we introuce a more general framework which allows a rather large class of methos of measuring the luckiness of a sample, in the sense that the large margin is lucky. In Section 7 we explictly show how Vapnik s maximum margin hyperplanes fit into this general framework, which then also allows the raius of the set of points to be estimate from the ata. In aition, we show that the function which measures the VC imension of the set of hypotheses on the sample points is a vali (un)luckiness function. This leas to a boun on the generalization performance in terms of this measure imension rather than the worst case boun which involves the VC imension of the set of hypotheses over the whole input space. Our approach can be interprete as a general way of encoing our bias, or prior assumptions, an possibly taking avantage of them if they happen to be correct. In the case of the fixe hierarchy, we expect the target (or a close approximation to it) to be in a class H with small. In the maximal separation case, we expect the target to be consistent with some classifying hyperplane that has a large separation from the examples. This correspons to a collusion between the probability istribution an the target concept, which woul be impossible to exploit in the stanar PAC istribution inepenent framework. If these assumptions happen to be correct for the training ata, we can be confient we have an accurate hypothesis from a small ata set (at the expense of some small penalty if they are incorrect). A commonly stuie relate problem is that of moel orer selection (see for example [34]), an we here briefly make some remarks on the relationship with the work presente in this paper. Assuming the above hierarchy of hypothesis classes, the aim there is to ientify the best class inex. Often best in this literature simply means correct in the sense that if in fact the target hypothesis h 2 H i, then as the sample size grows to infinity, the selection proceure will (in some probabilistic sense) pick i. Other methos of complexity regularization can be seen to also solve similar problems. (See for example [20, 6, 7, 8].) We are not aware of any methos (apart from SRM) for which explicit finite sample size bouns on their performance are available. Furthermore, with the exception of the methos iscusse in [8], all such methos take the form of minimizing a cost function comprising an empirical risk plus an aitive complexity term which oes not epen on the ata. We enote logarithms to base 2 by log, an natural logarithms by ln. If S is a set, jsj enotes its carinality. We o not explictly state the measurability conitions neee for our arguments to hol. We assume with no further iscussion permissibility of the function classes involve (see Appenix C of [41] an section 2.3 of [45]). 3

4 2 Stanar Structural Risk Minimisation As an initial example we consier a hierarchy of classes H 1 H 2 ::: H ::: where H i f0 1g X for some input space X, an where we will assume VCim(H ) = for the rest of this section. (Recall that the VC-imension of a class of f0 1g-value functions is the size of the largest subset of their omain for which the restriction of the class to that subset is the set of all f0 1g-value functions; see [49].) Such a hierarchy of classes is calle a ecomposable concept class by Linial et al. [29]. Relate work is presente by Beneek an Itai [12]. We will assume that a fixe number m of labelle examples are given as a vector z = (x t(x)) to the learner, where x =(x 1 ::: x m ), an t(x) =(t(x 1 ) ::: t(x m )), an that the target function t lies in one of the subclasses H. The learner uses an algorithm to fin a value of which contains an hypothesis h that is consistent with the sample z. What we require is a function (m ) which will give the learner an upper boun on the generalization error of h with confience 1 ;. The following theorem gives a suitable function. We use Er z (h) =jfi : h(x i ) 6= t(x i )gj to enote the number of errors that h makes on z, an er P (h) =P fx: h(x) 6= t(x)g to enote the expecte error when x 1 ::: x m are rawn inepenently accoring to P. In what follows we will often write Er x (h) (rather than Er z (h)) when the target t is obvious from the context. The following theorem, which appears in [43], covers the case where there are no errors on the training set. It is a well-known result which we quote for completeness. Theorem 2.1 [43] Let H i, i = 1 2 ::: be a sequence of hypothesis classes mapping X to f0 1g such that VCim(H i ) = i, an P let P be a probability istribution on X. Let p be any 1 set of positive numbers satisfying p =1 = 1: With probability 1 ; over m inepenent examples rawn accoring to P, for any for which a learner fins a consistent hypothesis h in H, the generalization error of h is boune from above by 2em provie m. (m ) = 4 m ln +ln 1 p +ln The role of the numbers p may seem a little counter-intuitive as we appear to be able to bias our estimate by ajusting these parameters. The numbers must, however, be specifie in avance an represent some apportionment of our confience to the ifferent points where failure might occur. In this sense they shoul be one of the arguments of the function (m ). Wehave eliberately omitte this epenence as they have a ifferent status in the learning framework. It is helpful to think of p as our prior estimate of the probability that the smallest class containing a consistent hypothesis is H. In particular we can set p =0for >m, since we woul expect to be able to fin a consistent hypothesis in H m an if we fail the boun will not be useful for such large in any case. We also wish to consier the possibility of errors on the training sample. The result presente here is analagous to those obtaine by Lugosi an Zeger [37] in the statistical framework. We will make use of the following result of Vapnik in a slightly improve version ue to Anthony an Shawe-Taylor [4]. Note also that the result is expresse in terms of the quantity 4 4

5 Er z (h) which enotes the number of errors of the hypothesis h on the sample z, rather than the usual proportion of errors. Theorem 2.2 ([4]) Let 0 < < 1 an 0 < 1. Suppose H is an hypothesis space of functions from an input space X to f0 1g, an let be any probability measure on S = X f0 1g. Then the probability (with respect to m ) that for z 2 S m, there is some h 2 H such that er (h) > an Er z (h) m(1 ; )er (h) is at most 4 H (2m)exp ; 2 m 4 Our aim will be to use a ouble stratification of ; as before by class (via p ), an also by the number of errors on the sample (via q k ). The generalization error will be given as a function of the size of the sample m, inex of the class, the number of errors on the sample k, an the confience. Theorem 2.3 Let H i, i = 1 2 :::, be a sequence of hypothesis classes mapping X to f0 1g an having VC-imension i. Let be any probability measure on S = X f0 1g, an let p q k be any sets of positive numbers satisfying 1X =1 p =1 P m an q k=0 k = 1 for all. Then with probability 1 ; over m inepenent ientically istribute examples x, if the learner fins an hypothesis h in H with Er x (h) = k, then the generalization error of h is boune from above by (m k ) = 1 4 2em 2k +4ln +4ln m p q k provie m. Proof : We boun the require probability of failure m fz : 9 k 9h 2 H Er z (h) =k er (h) >(m k )g < by showing that for all an k m fz : 9h 2 H Er z (h) =k er (h) >(m k )g <p q k : We will apply Theorem 2.2 once for each value of k an. We must therefore ensure that only one value of = k is use in each case. An appropriate value is k =1; k m(m k ) : : 5

6 This ensures that if er (h) >(m k ) an Er z (h) =k, then Er z (h) =k = m(1 ; k )(m k ) m(1 ; k )er (h) as require for an application of the theorem. Hence, if m Sauer s lemma implies ( 4 m fz : 9h 2 H Er z (h) =k er (h) >(m k )g <p q k 2em exp ; 2 k(m k )m <p q k 4 2em 4 (m k )m +ln ; + 2k p q k 4 4 =0 4 2em 2k +4ln +4ln p q k ( ln ( (m k ) = 1 m ignoring one term of k 2 =(4m). The result follows. The choice of the prior q k for ifferent k will again affect the resulting trae-off between complexity an accuracy. In view of our expectation that the penalty term for choosing a large class is probably an overestimate, it seems reasonable to give a corresponingly large penalty for a large numbers of errors. One possibility is an exponentially ecreasing prior istribution such as q k =2 ;(k+1) though the rate of ecrease coul also be varie between classes. Assuming the above choice, observe that an incremental search for the optimal value of woul stop when the reuction in the number of classification errors in the next class was less than 2em 0:84 ln : Note that the traeoff between errors on the sample an generalization error is also iscusse in [16]. 3 Classifiers with a Large Margin The stanar methos of structural risk minimization require that the ecomposition of the hypothesis class be chosen in avance of seeing the ata. In this section we introuce our first variant of SRM which effectively makes a ecomposition after the ata has been seen. The main tool we use is the fat-shattering imension, which was introuce in [26], an has been use for several problems in learning since [1, 11, 2, 10]. We show that if a classifier correctly classifies a training set with a large margin, an if its fat-shattering function at a scale relate to this margin is small, then the generalization error will be small. (This is formally state in Theorem 3.9 below.) Definition 3.1 Let F be a set of real value functions. We say that a set of points X is - shattere by F if there are real numbers r x inexe by x 2 X such that for all binary vectors b 6

7 inexe by X, there is a function f b 2Fsatisfying rx + if b x =1 f b (x) r x ; otherwise: The fat shattering imension fat F of the set F is a function from the positive real numbers to the integers which maps a value to the size of the largest -shattere set, if this is finite or infinity otherwise. Let T enote the threshol function at, T : R!f0 1g, T () =1 iff >. Fix a class of [0 1]-value functions. We can interpret each function f in the class as a classification function by consiering the threshole version, T 1=2 f. The following result implies that, if a realvalue function in the class maps all training examples to the correct sie of 1=2 by a large margin, the misclassification probability of the threshole version of the function epens on the fat-shattering imension of the class, at a scale relate to the margin. This result is a special case of Corollary 6 in [2], which applie more generally to arbitrary real-value target functions. (This application to classification problems was not escribe in [2].) Theorem 3.2 Let H be a set of [0 1]-value functions efine on a set X. Let 0 < < 1=2. There is a positive constant K such that, for any function t : X! f0 1g an any probability istribution P on X, with probability at least 1 ; over a sequence x 1 ::: x m of examples chosen inepenently accoring to P, every h in H that has jh(x i ) ; t(x i )j < 1=2 ; for i =1 ::: msatisfies provie that Pr (jh(x) ; t(x)j 1=2) < m K log 1 + log2 where = fat H (=8). Clearly, this implies that the misclassification probability is less than uner the conitions of the theorem, since T 1=2 (h(x)) 6= t(x) implies jh(x) ; t(x)j 1=2. In the remainer of this section, we present an improvement of this result. By taking avantage of the fact that the target values fall in the finite set f0 1g, an the fact that only the behaviour near the threshol of functions in H is important, we can remove the = factor from the log 2 factor in the boun. We also improve the constants that woul be obtaine from the argument use in the proof of Theorem 3.2. Before we can quote the next lemma, we nee another efinition. Definition 3.3 Let (X ) be a (pseuo-) metric space, let A be a subset of X an >0. A set B X is an -cover for A if, for every a 2 A, there exists b 2 B such that (a b) <. The -covering number of A, N ( A), is the minimal carinality of an -cover for A (if there is no such finite cover then it is efine to be 1). 7

8 The iea is that B shoul be finite but approximate all of A with respect to the pseuometric. As in [2], we will use the l 1 istance over a finite sample x =(x 1 ::: x m ) for the pseuometric in the space of functions, x (f g) = max jf (x i ) ; g(x i )j: i We write N ( F x) for the -covering number of F with respect to the pseuo-metric x. We now quote a lemma from Alon et al. [1] which we will use below. Lemma 3.4 (Alon et al. [1]) Let F be a class of functions X! [0 1] an P a istribution over X. Choose 0 <<1an let = fat F (=4). Then log(2em=()) 4m E (N ( F x)) 2 where the expectation E is taken w.r.t. a sample x 2 X m rawn accoring to P m. Corollary 3.5 Let F be a class of functions X! [a b] an P a istribution over X. Choose 0 <<1an let = fat F (=4). Then 4m(b ; a) 2 log(2em(b;a)=()) E (N ( F x)) 2 where the expectation E is over samples x 2 X m rawn accoring to P m. Proof : We first scale all the functions in F by the affine transformation mapping the interval [a b] to [0 1] to create the set of functions F 0. Clearly, fat F 0() = fat F ((b ; a)), while E(N ( F x)) = E(N (=(b ; a) F 0 x)): The result follows. In orer to motivate the next lemma we first introuce some notation we will use when we come to apply it. The aim is to transform the problem of observing a large margin into one of observing the maximal value taken by a set of functions. We o this by foling over the functions at the threshol. The following hat operator implements the foling. We efine the mapping ^: R X! R Xf0 1g by ^: f 7! ^f (x c) =f (x)(1 ; c) +(2 ; f (x))c for some fixe real. For a set of functions F, we efine ^F = ^F = f ^f : f 2 Fg. The iea behin this mapping is that for a function f, the corresponing ^f maps the input x an it classification c to an output value, which will be less than provie the classification obtaine by thresholing f (x) at is correct. Lemma 3.6 Suppose F is a set of functions that map from X to R with finite fat-shattering imension boune by the function afat : R! N which is continuous from the right. Then for any istribution P on X, an any k 2 N an any 2 R P (xy: 2m 9f 2F r = maxff (x j )g 2 = ; r afat(=4) = k j where (m k ) = 1 8em (k log log(32m) + log 2 ). m k m jfi : f (y i) r +2gj >(m k ) 8 ) <

9 Proof : Using the stanar permutation argument (as in [49]), we may fix a sequence xy an boun the probability uner the uniform istribution on swapping permutations that the permute sequence satisfies the conition state. Let k = minf 0 : afat( 0 =4) kg. Notice that the minimum is efine since afat is continuous from the right, an also that afat( k =4) = afat(=4). For any satisfying afat(=4) = k, wehave k, so the probability above is no greater than P 2m (xy : 9 2 R + afat(=4) = k 9f 2F A f (2 k ) where A f () is the event that f (y i ) max j ff (x j )g + for at least m(m k ) points y i in y. Note that r +2 =. Let () := 8 < : ; 2 k if > if <; 2 k otherwise, an let (F) = f(f ): f 2 Fg. Consier a minimal k -cover B xy of (F) in the pseuometric xy. We have that for any f 2F, there exists ~ f 2 B xy, with j(f )(x);( ~ f)(x)j < k for all x 2 xy. Thus since for all x 2 x, by the efinition of r, f (x) r = ; 2, (f )(x) = ; 2 k = r +2( ; k ), an so ( ~ f )(x) <r+2 ; k. However there are at least m(m k ) points y 2 y such that f (y) = r +2,so( f)(y) ~ >r+2 ; k > max j f( f ~ )(x j )g. Since only reuces separation between output values, we conclue that the event A f ~ (0) occurs. By the permutation argument, for fixe f ~ at most 2 ;(m k )m of the sequences obtaine by swapping corresponing points satisfy the conitions, since the m points with the largest f ~ values must remain on the right han sie for A f ~ (0) to occur. Thus by the union boun P 2m (xy : 9 2 R + afat(=4) = k 9f 2F A f (2 k ) ) ) E(jB xy j)2 ;(m k )m where the expectation is over xy rawn accoring to P 2m. Now for all > 0, fat (F ) () fat F () since every set of points -shattere by (F) can be -shattere by F. Furthermore, (F) is a class of functions mapping a set X to the interval [;2 k ]. Hence, by Corollary 3.5 (setting [a b] to [ ; 2 k ], to k, an m to 2m), 8m( ; +2k ) 2 log(4em(2k )=( k )) E(jB xy j)=e(n ( k (F) xy)) 2 where = fat (F ) ( k =4) fat F ( k =4) k. Thus an so E(jB xy j)2 ;(m k )m <provie 2 k E(jB xy j) 2(32m) k log(8em=k) as require. (m k ) 1 m ; k log(8em=k) log(32m) +log 2 9

10 The function afat() is use in this theorem rather than fat F () since we use the continuity property to ensure that afat( k =4) = k for every k, while we cannot assume that fat F () is continuous from the right. We coul avoi this requirement an give an error estimate irectly in terms of fat F instea of afat, but this woul introuce a worse constant in the argument of fat F. Since in practice one works with continuous upper bouns on fat F () (e.g. c= 2 ) by taking the floor of the value, the critical question becomes whether the boun is strict rather than less than or equal. Provie fat F is strictly less than the continuous boun the corresponing floor function is continuous from the right. If not aition of an arbitrarily small constant to the continuous function will allow substitution of a strict inequality. Lemma 3.7 Let F be a set of real value functions from X to R. Then for all 0, fat ^F () =fat F (): Proof : For any c 2 f0 1g m, we have that f b realises ichotomy b on x = (x 1 ::: x m ) with margin about output values r i if an only if ^f b realises ichotomy b c on with margin about output values ^x =((x 1 c 1 ) ::: (x m c m )) ^r i = r i (1 ; c i )+(2 ; r i )c i : We will make use of the following lemma, which in the form below is ue to Vapnik [46, page 168]. Lemma 3.8 Let X be a set an S a system of sets on X, an P a probability measure on X. For x 2 X m an A 2 S, efine x (A) :=jx \ Aj=m. Ifm>2=, then P x: m sup j x (A) ; P (A)j > A2S 2P 2m xy: sup j x (A) ; y (A)j >=2 A2S Let T enote the threshol function at : T : R!f0 1g, T () =1iff >. For a class of functions F, T (F) =ft (f ): f 2Fg. Theorem 3.9 Consier a real value function class F having fat shattering function boune above by the function afat : R! N which is continuous from the right. Fix 2 R. If a learner correctly classifies m inepenently generate examples z with h = T (f ) 2 T (F) such that er z (h) = 0 an = min jf (x i ) ; j, then with confience 1 ; the expecte error of h is boune from above by (m k ) = 2 8em 8m k log log(32m) + log m k where k = afat(=8). : 10

11 Proof : The proof will make use of lemma 3.8. First we will move to a ouble sample an stratify by k. By the union boun, it thus suffices to show that where P 2m ( 2m[ k=1 J k ) 2mX k=1 P 2m (J k ) <=2 J k = fxy : 9h = T (f ) 2 T (F) Er x (h) =0 k = afat(=8) = min jf (x i ) ; j Er y (h) m(m k )=2g: (The largest value of k we nee consier is 2m, since we cannot shatter a greater number of points from xy.) It is sufficient if P 2m (J k ) 4m = 0 : Consier ^F = ^F an note that by Lemma 3.7 the function afat() also bouns fat ^F (). The probability istribution on ^X = X f0 1g is given by P on X with the secon component etermine by the target value of the first component. Note that for a point y 2 y to be misclassifie, it must have so that J k n ^f (^y) maxf ^f (^x): ^x 2 ^xg + = ^x^y 2 (X f0 1g) 2m : 9 ^f 2 ^F r = maxf ^f (^x): ^x 2 ^xg = ; r Replacing by =2 in Lemma 3.6 we obtain k = afat(=8) f^y 2 ^y: ^f (^y) g m(m k )=2 o P 2m (J k ) 0 for (m k ) = 2 m (k log(8em=k) log(32m) +log(2=0 )) : With this linking of an m, the conition S of Lemma P 3.8 is satisfie. Appealling to this an noting that the union boun gives P 2m 2m ( J 2m k=1 k) P 2m (J k=1 k ) we conclue the proof by substituting for 0. A relate result, that gives bouns on the misclassification probability of threshole functions in terms of an error estimate involving the margin of the corresponing real-value functions, is given in [9]. Using this result an bouns on the fat-shattering imension of sigmoial neural networks, that paper also gives bouns on the generalization performance of these networks that epen on the size of the parameters but are inepenent of the number of parameters. 11

12 4 Large Margin Hyperplanes We will now consier a particular case of the results in the previous section, applicable to the class of of linear threshol functions in Eucliean space. Vapnik an others [46, 48, 14, 16], [18, page 140] have suggeste that choosing the maximal margin hyperplane (i.e. the hyperplane which maximises the minimal istance of points assuming a correct classification can be mae) will improve the generalization of the resulting classifier. They give evience to inicate that the generalization performance is frequently significantly better than that preicte by the VC imension of the full class of linear threshol functions. In this section of the paper we will show that inee a large margin oes help in this case, an we will give an explicit boun on the generalization error in terms of the margin achieve on the training sample. We o this by first bouning the appropriate fat-shattering function, an then applying theorem 3.9. The margin also arises in the proof of the perceptron convergence theorem (see for example [23, page 61 62], where an alternate motivation is given for a large margin: noise immunity). The margin occurs even more explictly in the Winnow algorithms an their variants evelope by Littlestone an others [30, 31, 32]. The connection between these two uses has not yet been explore. Consier a hyperplane efine by (w ), where w is a weight vector an a threshol value. Let X 0 be a subset of the Eucliean space that oes not have a limit point on the hyperplane, so that min x2x 0 jhx wi + j > 0: We say that the hyperplane is in canonical form with respect to X 0 if min jhx wi + j =1: x2x 0 Let kkenote the Eucliean norm. The maximal margin hyperplane is obtaine by minimising kwk subject to these constraints. The points in X 0 for which the minimum is attaine are calle the support vectors of the maximal margin hyperplane. The following theorem is the basis for our argument for the maximal margin analysis. Theorem 4.1 (Vapnik [48]) Suppose X 0 is a subset of the input space containe in a ball of raius R about some point. Consier the set of hyperplanes in canonical form with respect to X 0 that satisfy kwk A, an let F be the class of corresponing linear threshol functions, f (x w) =sgn(hx wi + ): Then the restriction of F to the points in X 0 has VC imension boune by minfr 2 A 2 ng +1: Our argument will also be in terms of Theorem 3.9, an to that en we nee to boun the fatshattering imension of the class of hyperplanes. We o this via an argument concerning the level fat-shattering imension, efine below. Definition 4.2 Let F be a set of real value functions. We say that a set of points X is level -shattere by F at level r if it can be -shattere when choosing the r x = r for all x 2 X. The 12

13 level fat shattering imension lfat F of the set F is a function from the positive real numbers to the integers which maps a value to the size of the largest level -shattere set, if this is finite or infinity otherwise. The level fat-shattering imension is a scale sensitive version of a imension introuce by Vapnik [46]. The scale sensitive version was first introuce by Alon et al. [1]. Lemma 4.3 Let F be the set of linear functions with unit weight vectors, restricte to points in a ball of raius R, F = fx 7! hw xi + : kwk =1g : (1) Then the level fat shattering function can be boune from above by lfat F () minfr 2 = 2 ng +1: Proof : If a set of points X = fx i g i is to be level -shattere there must be a value r such that each ichotomy b can be realise with a weight vector w b an threshol b such that r hw b x i i + b + if bi =1 r ; otherwise. Let = min x2x jhw b xi + b ; rj. Consier the hyperplane efine by (w b b ) = (w b = b = ; r=). It is in canonical form with respect to the points X, satisfies kw b k = kw b =k =1= an realises ichotomy b on X. Hence, the set of points X can be shattere by a subset of canonical hyperplanes (w b b ) satisfying kw b k 1= 1=. The result follows from Theorem 4.1. Corollary 4.4 Let F be the set, efine in (1), of linear functions with unit weight vectors, restricte to points in a ball of n imensions of raius R about the origin an with threshols jj R. The fat shattering function of F can be boune by fat F () minf9r 2 = 2 n+1g +1: Proof : Suppose m points x 1 ::: x m lying in a ball of raius R about the origin are -shattere relative to r =(r 1 ::: r m ). Since kwk =1, jhw x i i + j 2R, an so jr i j2r. From each x i, i = 1 ::: m, we create an extene vector p x i := (x i 1 ::: xi n r i= 2). Since jr i j 2R, kx i k p 3R. Let (w b b ) be the parameter vector of the hyperplane that realizes a ichotomy b 2f0 1g m. Set w b =(w b 1 ::: wb n ;p 2). We now show that the points x i, i = 1 ::: mare level -shattere at level 0 by fw b g b2f0 1g m. We have that hw b x i i + b = hw b x i i + b ; r i =: t. But hw b x i i + b r i + if b i =1, an hw b x i i + b r i ; if b i =0. Thus t r i + ; r i = if b i =1 t r i ; ; r i = if b i =0: p Now kw b k = 3. Set p p ~w b = w b = 3, an ~x i = 3x i. Then k ~w b k = 1 an the points ~x i, i = 1 ::: m are level -shattere at level 0 by f ~w b g b2f0 1g m. Since im ~x i = n +1 an k~x i k p 3 p3r =3R, we have by Lemma 4.3 that fat F () minf 9R n+1g +1.

14 Theorem 4.5 Suppose inputs are rawn inepenently accoring to a istribution whose support is containe in a ball in R n centere at the origin, of raius R. If we succee in correctly classifying m such inputs by a canonical hyperplane with kwk = 1= an with jj R, then with confience 1 ; the generalization error will be boune from above by (m ) = 2 8em k log log(32m) +log 8m m k where k = b577r 2 = 2 c. Proof : Firstly note that we can restrict our consieration to the subclass of F with jj R. If there is more than one point to be -shattere, then it is require to achieve a ichotomy with ifferent signs; that is b is neither all 0s nor all 1s. Since all of the points lie in the ball, to shatter them the hyperplane must intersect the ball. Since kwk =1, that means jj R. So although one may achieve a greater margin for the all-zero or all-one ichotomy by choosing a larger value of, all of the other ichotomies cannot achieve a larger. Thus although the boun may be weak in the special case of an all 0 or all 1 classification on the training set, it will still be true. Hence, we are now in a position to apply Theorem 3.9 with the value of given in the theorem taken as 0. Hence, fat F (=8) b576r 2 = 2 +1c < b577r 2 = 2 c since <R. Substituting into the boun of Theorem 3.9 gives the require boun. In section 6 we will give an analogous result as a special case of the more general framework erive in section 5. Although the sample size boun for that result is weaker (by an aitional log(m) factor), it oes allow one to cope with the slightly more general situation of estimating the raius of the ball rather than knowing it in avance. The fact that the boun in Theorem 4.5 oes not epen on the imension of the input space is particularly important in the light of Vapnik s ingenious construction of his support-vector machines [16, 48]. This is a metho of implementing quite complex ecision rules (such as those efine by polynomials or neural networks) in terms of linear hyperplanes in very many imensions. The clever part of the technique is the algorithm which can work in a ual space, an which maximizes the margin on a training set. Thus Vapnik s algorithm along with the boun of Theorem 4.5 shoul allow goo a posteriori bouns on the generalization error in a range of applications. It is important to note that our explanation of the goo performance of maximum margin hyperplanes is ifferent to that given by Vapnik in [48, page 135]. Whilst alluing to the result of theorem 4.1, the theorem he presents as the explanation is a boun on the expecte generalization error in terms of the number of support vectors. A small number of support vectors gives a goo boun. One can construct examples in which all four combinations of small/large margin an few/many support vectors occur. Thus neither explanation is the only one. In the terminology of the next section, the margin an (the reciprocal of) the number of support vectors are both luckiness functions, an either coul be use to etermine bouns on performance. 14

15 5 Luckiness: A General Framework for Decomposing Classes The stanar PAC analysis gives bouns on generalization error that are uniform over the hypothesis class. Decomposing the hypothesis class, as escribe in Section 2, allows us to bias our generalization error bouns in favour of certain target functions an istributions: those for which some hypothesis low in the hierarchy is an accurate approximation. The results of section 4 show that it is possible to ecompose the hypothesis class on the basis of the observe ata in some cases: there we i it in terms of the margin attaine. In this section, we introuce a more general framework which subsumes the stanar PAC moel, the framework escribe in Section 2 an can recover (in a slightly weaker form) the results of Section 4 as a special case. This more general ecomposition of the hypothesis class base on the sample allows us to bias our generalization error bouns in favour of more general classes of target functions an istributions, which might correspon to more realistic assumptions about practical learning problems. It seems that in orer to allow the ecomposition of the hypothesis class to epen on the sample, we nee to make better use of the information provie by the sample. Both the stanar PAC analysis an structural risk minimisation with a fixe ecomposition of the hypothesis class effectively iscar the training examples, an only make use of the function Er z efine on the hypothesis class that is inuce by the training examples. The aitional information we exploit in the case of sample-base ecompositions of the hypothesis class is encapsulate in a luckiness function. The main iea is to fix in avance some assumption about the target function an istribution, an encoe this assumption in a real-value function efine on the space of training samples an hypotheses. The value of the function inicates the extent to which the assumption is satisfie for that sample an hypothesis. We call this mapping a luckiness function, since it reflects how fortunate we are that our assumption is satisfie. That is, we make use of a function L : X m H! R + which measures the luckiness of a particular hypothesis with respect to the training examples. Sometimes it is convenient to express this relationship in an inverte way, as an unluckiness function, U : X m H! R + : It turns out that only the orering that the luckiness or unluckiness functions impose on hypotheses is important. We efine the level of a function h 2 H relative to L an x by the function `(x h)=jfb 2f0 1g m : 9g 2 H g(x) =b L(x g) L(x h)gj or `(x h)=jfb 2f0 1g m : 9g 2 H g(x) =b U(x g) U(x h)gj: Whether `(x h) is efine in terms of L or U is a matter of convenience; the quantity `(x h) itself plays the central role in what follows. If x y 2 X m, we enote by xy their concatenation (x 1 ::: x m y 1 ::: y m ). 15

16 5.1 Examples Example 5.1 Consier the hierarchy of classes introuce in Section 2 an efine U(x h) = minf : h 2 H g: Then it follows from Sauer s lemma that for any x we can boun `(x h) by em `(x h) where = U(x h). Notice also that for any y 2 X m, 2em `(xy h) : The last observation is something that will prove useful later when we investigate how we can use luckiness on a sample to infer luckiness on a subsequent sample. We show in Section 6 that the hyperplane margin of Section 5 is a luckiness function which satisfies the technical restrictions we introuce below. We o this in fact in terms of the following unluckiness function, efine formally here for convenience later on. Definition 5.2 If h is a linear threshol function with separating hyperplane efine by (w ), an (w ) is in canonical form with respect to an m-sample x, then efine U(x h)= max 1im kx ik 2 kwk 2 : Finally, we give a separate unluckiness function for the maximal margin hyperplane example. In practical experiments it is frequently observe that the number of support vectors is significantly smaller than the full training sample. Vapnik [48, Theorem 5.2] gives a boun on the expecte generalization error in terms of the number of support vectors as well as giving examples of classifiers [48, Table 5.2] for which the number of support vectors was very much less than the number of training examples. We will call this unluckiness function the support vectors unluckiness function. Definition 5.3 If h is a linear threshol function with separating hyperplane efine by (w ), an (w ) is the maximal margin hyperplane in canonical form with respect to an m-sample x, then efine U(x h)=jfx 2 x : jhx wi + j =1gj that is U is the number of support vectors of the hyperplane. 5.2 Probable Smoothness of Luckiness Functions We now introuce a technical restriction on luckiness functions require for our theorem. 16

17 Definition 5.4 An -subsequence of a vector x is a vector x 0 obtaine from x by eleting a fraction of at most coorinates. We will also write x 0 x. For a partitione vector xy, we write x 0 y 0 xy. A luckiness function L(x h) efine on a function class H is probably smooth with respect to functions (m L ) an (m L ), if, for all targets t in H an for every istribution P, P 2m fxy : 9h 2 H Er x (h) =0 8x 0 y 0 xy `(x 0 y 0 h) >(m L(x h) )g where = (m L(x h) ). The efinition for probably smooth unluckiness is ientical except that L s are replace by U s. The intuition behin this rather arcane efinition is that it captures when the luckiness can be estimate from the first half of the sample with high confience. In particular, we nee to ensure that few ichotomies are luckier than h on the ouble sample. That is, for a probably smooth luckiness function, if an hypothesis h has luckiness L on the first m points, we know that, with high confience, for most (at least a proportion (m L )) of the points in a ouble sample, the growth function for the class of functions that are at least as lucky as h is small (no more than (m L )). P 2m Theorem 5.5 Suppose p, = 1 ::: 2m, are positive numbers satisfying p i=1 i = 1, L is a luckiness function for a function class H that is probably smooth with respect to functions an, m 2 N an 0 < < 1=2. For any target function t 2 H an any istribution P, with probability 1 ; over m inepenent examples x chosen accoring to P, if for any i 2 N a learner fins an hypothesis h in H with Er x (h) = 0 an (m L(x h) ) 2 i+1, then the generalization error of h satisfies er P (h) (m i ) where (m i ) = 2 m Proof : By Lemma 3.8, i +1+log 4 p i +4(m L(x h) p i =4) log 4m: P m x : 9h 2 H 9i 2 N Er x (h) =0 (m L(x h) ) 2 i+1 er P (h) >(m i ) 2P 2m xy : 9h 2 H 9i 2 N Er x (h) =0 (m L(x h) ) 2 i+1 Er y (h) > m (m i ) 2 provie m 2=(m i ), which follows from the efinition of (m i ) an the fact that 1=2. Hence it suffices to show that P 2m (J i ) i = p i =2 for each i 2 N, where J i is the event xy : 9h 2 H Er x (h) =0 (m L(x h) ) 2 i+1 Er y (h) m (m i ) : 2 Let S be the event fxy : 9h 2 H Er x (h) =0 8x 0 y 0 xy `(x 0 y 0 h) >(m L(x h) )g with = (m L i i =2). It follows that P 2m (J i ) = P 2m (J i \ S) +P 2m (J i \ S) i =2+P 2m (J i \ S): 17

18 It suffices then to show that P 2m (J i \ S) i =2. But J i \ S is a subset of R = fxy: 9h 2 H Er x (h) =0 9x 0 y 0 xy `(x 0 y 0 h) 2 i+1 Er y 0(h) m 2 (m i ) ; (jyj;jy0 j) where jy 0 j enotes the length of the sequence y 0. Now, if we consier the uniform istribution U on the group of permutations on f1 ::: 2mg that swap elements i an i + m, wehave P 2m (R) sup U f :(xy) 2 Rg xy where z = (z (1) ::: z (2m) ) for z 2 X 2m. Fix xy 2 X 2m. For a subsequence x 0 y 0 xy, we let (x 0 y 0 ) enote the corresponing subsequence of the permute version of xy (an similarly for (x 0 ) an (y 0 ) ). Then U f:(xy) 2 Rg U : 9x 0 y 0 xy 9h 2 H `((x 0 y 0 ) h) 2 i+1 Er (x0 ) (h) =0 X x 0 y 0 xy Er (y 0 ) (h) m 2 (m i ) ; (jyj;jy0 j) U : 9h 2 H `((x 0 y 0 ) h) 2 i+1 Er (x0 ) (h) =0 Er (y 0 ) (h) m 2 (m i ) ; (jyj;jy0 j) : For a fixe subsequence x 0 y 0 xy, efine the event insie the last sum as A. We can partition the group of permutations into a number of equivalence classes, so that, for all i, within each class all permutations map i to a fixe value unless x 0 y 0 contains both x i an y i. Clearly, all equivalence classes have equal probability, so we have U(A) = X C Pr(AjC)Pr(C) sup Pr(AjC) C where the sum an supremum are over equivalence classes C. But within an equivalence class, (x 0 y 0 ) is a permutation of x 0 y 0, so we can write Pr(AjC) = Pr ; 9h 2 H `((x 0 y 0 ) h) 2 i+1 Er (x0 ) sup 2C Hj(x0 (h) =0 Er (y 0 ) (h) m 2 (m i ) ; (jyj;jy0 j) y 0 ) sup h C Pr ; Er (x0 ) (h) =0 Er (y 0 ) (h) m 2 (m i ) C (2) where the secon supremum is over the subset of H for which `((x 0 y 0 ) h) 2 i+1. Clearly, an the probability in (2) is no more than Hj(x0 y 0 ) 2 i+1 2 ;m(m i )=2+2m : 18

19 Combining these results, we have 2m P 2m (J i \ S) 2m an this is no more than i =2=p i =2 if m The theorem follows. 2 2 i+1 2 ;m=2(m i )+2m (m i ) 2m log(2m) +i +1+2m + log 4 p i : 6 Examples of Probably Smooth Luckiness Functions In this section, we consier four examples of luckiness functions an show that they are probably smooth. The first example (Example 5.1) is the simplest; in this case luckiness epens only on the hypothesis h an is inepenent of the examples x. In the secon example, luckiness epens only on the examples, an is inepenent of the hypothesis. The thir example allows us to preict the generalization performance of the maximal margin classifier. In this case, luckiness clearly epens on both the examples an the hypothesis. (This is the only example we present here where the luckiness function is both a function of the ata an the hypothesis.) The fourth example concerns the VC-imension of a class of functions when restricte to the particular sample available. First Example If we consier Example 5.1, the unluckiness function is clearly probably smooth if we choose (m U(x h) )=(2em=U) U an (m U ) =0for all m an. The boun on generalization error that we obtain from Theorem 5.5 is almost ientical to that given in Theorem 2.1. Secon Example The secon example we consier involves examples lying on hyperplanes. Definition 6.1 Define the unluckiness function U(x h) for a linear threshol function h as U(x h) = im spanfxg the imension of the vector space spanne by the vectors x. Proposition 6.2 Let H be the class of linear threshol functions efine on R. The unluckiness function of Definition 6.1 is probably smooth with respect to (m U ) =(2em=U) U an (m U ) = 4 2em 4 U ln +ln : m U 19

20 Proof : The recognition of a k imensional subspace is a learning problem for the inicator functions H k of the subspaces. These have VC imension k. Hence, applying the hierarchical approach of Theorem 2.1 taking p k = 1=, we obtain the given error boun for the number of examples in the secon half of the sequence lying outsie the subspace. Hence, with probability 1 ; there will be a (1 ; )-subsequence of points all lying in the given subspace. For this sequence the growth function is boune by (m U ). The above example will be useful if we have a istribution which is highly concentrate on the subspace with only a small probability of points lying outsie it. We conjecture that it is possible to relax the assumption that the probability istribution is concentrate exactly on the subspace, to take avantage of a situation where it is concentrate aroun the subspace an the classifications are compatible with a perpenicular projection onto the space. This will also make use of both the ata an the classification to ecie the luckiness. Thir Example We are now in a position to state the result concerning maximal margin hyperplanes. Proposition 6.3 The unluckiness function of Definition 5.2 is probably smooth with (m U ) = (2em=(9U)) 9U an (m U ) = 1 8em 2 k log log(32m) + log(4m + 2) + 2 log 2m k where k = b1297uc. Proof : By the efinition of the unluckiness function U, we have that the maximal margin hyperplane has margin satisfying, U = R 2 = 2 where R = max 1im kx ik: The proof works by allowing two sets of points to be exclue from the secon half of the sample, hence making up the value of. By ignoring these points with probability 1 ; the remaining points will be in the ball of raius R about the origin an will be correctly classifie by the maximal margin hyperplane with a margin of =3. Provie this is the case then the function (m U ) gives a boun on the growth function on the ouble sample of hyperplanes with larger margins. Hence, it remains to show that with probability 1 ; there exists a fraction of (m U ) points of the ouble sample whose removal leaves a subsequence of points satisfying the above conitions. First consier the class H = ff j 2 R + g where f (x) = 1 0 if kxk otherwise: 20

21 The class has VC imension 1 an so by the permutation argument with probability 1 ; =2 at most a fraction 1 of the secon half of the sample are outsie the ball B centere at the origin containing with raius R, where 1 = 1 m log(2m +1)+log 2 since the growth function B H (m) =m+1. We now consier the permutation argument applie to the points of the ouble sample containe in B to estimate how many are closer to the hyperplane than =3 or are incorrectly classifie. This involves an application of Lemma 3.6 with =3 substitute for an using the foling argument introuce just before that Lemma. We have by Corollary 4.4 that fat F () minf9r 2 = 2 n+1g +1 where F is the set of linear threshol functions with unit weight vector restricte to points in a ball of raius R about the origin. Hence, with probability 1 ; =2 at most a fraction 2 of the secon half of the sample that are in B are either not correctly classifie or within a margin of =3 of the hyperplane, where 2 = 1 m k log 8em k log(32m) + log 4 for k = b1297r 2 = 2 c = b1297uc. The result follows by aing the numbers of exclue points 1 m an 2 m an expressing the result as a fraction of the ouble sample as require. Combining the results of Theorem 5.5 an Proposition 6.3 gives the following corollary. Corollary 6.4 Suppose p, for = 1 ::: 2m, are positive numbers satisfying P 2m =1 p = 1. Suppose 0 < < 1=2, t 2 H, an P is a probability istribution on X. Then with probability 1 ; over m inepenent examples x chosen accoring to P, if a learner fins an hypothesis h that satisfies Er x (h) =0, then the generalization error of h is no more than (m U ) = 2 m + 9U log 2em 9U b1297uc log + 2 log p32m log 4 8em b1297u c p i + log(32m) + log(16m +8) where U = U(x h) for the unluckiness function of Definition 5.2. log 4m If we compare this corollary with Theorem 4.5, there is an extra log(m) factor that arises from the fact that we have to consier all possible permutations of the omitte subsequence in the general proof, whereas that is not necessary in the irect argument base on fat-shattering. The aitional generality obtaine here is that the support of the probability istribution oes not nee to be known, an even if it is we may erive avantage from observing points with small norms, hence giving a better value of U = R 2 = 2 than woul be obtaine in Theorem 4.5 where the a priori boun on R must be use. 21

22 Vapnik has use precisely the expression for this unluckiness function (given in Definition 5.2) as an estimate of the effective VC imension of the Support Vector Machine [48, p.139]. The functional obtaine is use to locate the best suite complexity class among ifferent polynomial kernel functions in the Support Vector Machine [48, Table 5.6]. The result above shows that this strategy is well-foune by giving a boun on the generalization error in terms of this quantity. It is interesting to note that the support vectors unluckiness function of Definition 5.3 relates to the same classifiers but one which for a given same sample efines a ifferent orering on the functions in the class in the sense that a large margin can occur with a large number of support vectors, while a small margin can be force by a small number of support vectors. We will omit the proof of the probable smoothness of the support vectors unluckiness function since a more irect boun on the generalization error can be obtaine using the results of Floy an Warmuth [19]. Since the set of support vectors is a compression scheme, Theorem 5.1 of [19] can be rephrase as follows. Let MMH be the function that returns the maximal margin hyperplane consistent with a labelle sample. Note that applying the function MMH to the labelle support vectors returns the maximal margin hyperplane of which they are the support vectors. Theorem 6.5 (Littlestone an Warmuth [33]) Let D be any probability istribution on a omain X, c be any concept on X. Then the probability that m examples rawn inepenently at ranom accoring to D contain a subset of at most examples that map via MMH to a hypothesis that is both consistent with all m examples an has error larger than is at most X m (1 ; ) m;i : i i=0 The theorem implies that the generalization error of a maximal margin hyperplane with support vectors among a sample of size m can with confience 1 ; be boune by 1 m ; log em + log m where we have allowe ifferent numbers of support vectors by applying stanar SRM to the bouns for ifferent. Note that the value, that is the unluckiness of Definition 5.3, plays the role of the VC imension in the boun. Using a similar technique to that of the above theorem it is possible to show that the support vectors unluckiness function is inee probably smooth with respect to (m ) = 1 log 2em 2m + log 1 2em an (m ) = : However, the resulting boun on the generalization error involves an extra log factor. 22

Least-Squares Regression on Sparse Spaces

Least-Squares Regression on Sparse Spaces Least-Squares Regression on Sparse Spaces Yuri Grinberg, Mahi Milani Far, Joelle Pineau School of Computer Science McGill University Montreal, Canaa {ygrinb,mmilan1,jpineau}@cs.mcgill.ca 1 Introuction

More information

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012

Lecture Introduction. 2 Examples of Measure Concentration. 3 The Johnson-Lindenstrauss Lemma. CS-621 Theory Gems November 28, 2012 CS-6 Theory Gems November 8, 0 Lecture Lecturer: Alesaner Mąry Scribes: Alhussein Fawzi, Dorina Thanou Introuction Toay, we will briefly iscuss an important technique in probability theory measure concentration

More information

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013

Survey Sampling. 1 Design-based Inference. Kosuke Imai Department of Politics, Princeton University. February 19, 2013 Survey Sampling Kosuke Imai Department of Politics, Princeton University February 19, 2013 Survey sampling is one of the most commonly use ata collection methos for social scientists. We begin by escribing

More information

Lower Bounds for the Smoothed Number of Pareto optimal Solutions

Lower Bounds for the Smoothed Number of Pareto optimal Solutions Lower Bouns for the Smoothe Number of Pareto optimal Solutions Tobias Brunsch an Heiko Röglin Department of Computer Science, University of Bonn, Germany brunsch@cs.uni-bonn.e, heiko@roeglin.org Abstract.

More information

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions

Computing Exact Confidence Coefficients of Simultaneous Confidence Intervals for Multinomial Proportions and their Functions Working Paper 2013:5 Department of Statistics Computing Exact Confience Coefficients of Simultaneous Confience Intervals for Multinomial Proportions an their Functions Shaobo Jin Working Paper 2013:5

More information

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors

Math Notes on differentials, the Chain Rule, gradients, directional derivative, and normal vectors Math 18.02 Notes on ifferentials, the Chain Rule, graients, irectional erivative, an normal vectors Tangent plane an linear approximation We efine the partial erivatives of f( xy, ) as follows: f f( x+

More information

Linear First-Order Equations

Linear First-Order Equations 5 Linear First-Orer Equations Linear first-orer ifferential equations make up another important class of ifferential equations that commonly arise in applications an are relatively easy to solve (in theory)

More information

Separation of Variables

Separation of Variables Physics 342 Lecture 1 Separation of Variables Lecture 1 Physics 342 Quantum Mechanics I Monay, January 25th, 2010 There are three basic mathematical tools we nee, an then we can begin working on the physical

More information

Acute sets in Euclidean spaces

Acute sets in Euclidean spaces Acute sets in Eucliean spaces Viktor Harangi April, 011 Abstract A finite set H in R is calle an acute set if any angle etermine by three points of H is acute. We examine the maximal carinality α() of

More information

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an

More information

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks A PAC-Bayesian Approach to Spectrally-Normalize Margin Bouns for Neural Networks Behnam Neyshabur, Srinah Bhojanapalli, Davi McAllester, Nathan Srebro Toyota Technological Institute at Chicago {bneyshabur,

More information

Final Exam Study Guide and Practice Problems Solutions

Final Exam Study Guide and Practice Problems Solutions Final Exam Stuy Guie an Practice Problems Solutions Note: These problems are just some of the types of problems that might appear on the exam. However, to fully prepare for the exam, in aition to making

More information

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k

Robust Forward Algorithms via PAC-Bayes and Laplace Distributions. ω Q. Pr (y(ω x) < 0) = Pr A k A Proof of Lemma 2 B Proof of Lemma 3 Proof: Since the support of LL istributions is R, two such istributions are equivalent absolutely continuous with respect to each other an the ivergence is well-efine

More information

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments

Time-of-Arrival Estimation in Non-Line-Of-Sight Environments 2 Conference on Information Sciences an Systems, The Johns Hopkins University, March 2, 2 Time-of-Arrival Estimation in Non-Line-Of-Sight Environments Sinan Gezici, Hisashi Kobayashi an H. Vincent Poor

More information

7.1 Support Vector Machine

7.1 Support Vector Machine 67577 Intro. to Machine Learning Fall semester, 006/7 Lecture 7: Support Vector Machines an Kernel Functions II Lecturer: Amnon Shashua Scribe: Amnon Shashua 7. Support Vector Machine We return now to

More information

Sample width for multi-category classifiers

Sample width for multi-category classifiers R u t c o r Research R e p o r t Sample width for multi-category classifiers Martin Anthony a Joel Ratsaby b RRR 29-2012, November 2012 RUTCOR Rutgers Center for Operations Research Rutgers University

More information

Tractability results for weighted Banach spaces of smooth functions

Tractability results for weighted Banach spaces of smooth functions Tractability results for weighte Banach spaces of smooth functions Markus Weimar Mathematisches Institut, Universität Jena Ernst-Abbe-Platz 2, 07740 Jena, Germany email: markus.weimar@uni-jena.e March

More information

On the Value of Partial Information for Learning from Examples

On the Value of Partial Information for Learning from Examples JOURNAL OF COMPLEXITY 13, 509 544 (1998) ARTICLE NO. CM970459 On the Value of Partial Information for Learning from Examples Joel Ratsaby* Department of Electrical Engineering, Technion, Haifa, 32000 Israel

More information

Balancing Expected and Worst-Case Utility in Contracting Models with Asymmetric Information and Pooling

Balancing Expected and Worst-Case Utility in Contracting Models with Asymmetric Information and Pooling Balancing Expecte an Worst-Case Utility in Contracting Moels with Asymmetric Information an Pooling R.B.O. erkkamp & W. van en Heuvel & A.P.M. Wagelmans Econometric Institute Report EI2018-01 9th January

More information

Convergence of Random Walks

Convergence of Random Walks Chapter 16 Convergence of Ranom Walks This lecture examines the convergence of ranom walks to the Wiener process. This is very important both physically an statistically, an illustrates the utility of

More information

FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction

FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS. 1. Introduction FLUCTUATIONS IN THE NUMBER OF POINTS ON SMOOTH PLANE CURVES OVER FINITE FIELDS ALINA BUCUR, CHANTAL DAVID, BROOKE FEIGON, MATILDE LALÍN 1 Introuction In this note, we stuy the fluctuations in the number

More information

Function Spaces. 1 Hilbert Spaces

Function Spaces. 1 Hilbert Spaces Function Spaces A function space is a set of functions F that has some structure. Often a nonparametric regression function or classifier is chosen to lie in some function space, where the assume structure

More information

u!i = a T u = 0. Then S satisfies

u!i = a T u = 0. Then S satisfies Deterministic Conitions for Subspace Ientifiability from Incomplete Sampling Daniel L Pimentel-Alarcón, Nigel Boston, Robert D Nowak University of Wisconsin-Maison Abstract Consier an r-imensional subspace

More information

Lagrangian and Hamiltonian Mechanics

Lagrangian and Hamiltonian Mechanics Lagrangian an Hamiltonian Mechanics.G. Simpson, Ph.. epartment of Physical Sciences an Engineering Prince George s Community College ecember 5, 007 Introuction In this course we have been stuying classical

More information

Schrödinger s equation.

Schrödinger s equation. Physics 342 Lecture 5 Schröinger s Equation Lecture 5 Physics 342 Quantum Mechanics I Wenesay, February 3r, 2010 Toay we iscuss Schröinger s equation an show that it supports the basic interpretation of

More information

Quantum Mechanics in Three Dimensions

Quantum Mechanics in Three Dimensions Physics 342 Lecture 20 Quantum Mechanics in Three Dimensions Lecture 20 Physics 342 Quantum Mechanics I Monay, March 24th, 2008 We begin our spherical solutions with the simplest possible case zero potential.

More information

arxiv: v4 [cs.ds] 7 Mar 2014

arxiv: v4 [cs.ds] 7 Mar 2014 Analysis of Agglomerative Clustering Marcel R. Ackermann Johannes Blömer Daniel Kuntze Christian Sohler arxiv:101.697v [cs.ds] 7 Mar 01 Abstract The iameter k-clustering problem is the problem of partitioning

More information

Math 115 Section 018 Course Note

Math 115 Section 018 Course Note Course Note 1 General Functions Definition 1.1. A function is a rule that takes certain numbers as inputs an assigns to each a efinite output number. The set of all input numbers is calle the omain of

More information

A simple model for the small-strain behaviour of soils

A simple model for the small-strain behaviour of soils A simple moel for the small-strain behaviour of soils José Jorge Naer Department of Structural an Geotechnical ngineering, Polytechnic School, University of São Paulo 05508-900, São Paulo, Brazil, e-mail:

More information

PDE Notes, Lecture #11

PDE Notes, Lecture #11 PDE Notes, Lecture # from Professor Jalal Shatah s Lectures Febuary 9th, 2009 Sobolev Spaces Recall that for u L loc we can efine the weak erivative Du by Du, φ := udφ φ C0 If v L loc such that Du, φ =

More information

Influence of weight initialization on multilayer perceptron performance

Influence of weight initialization on multilayer perceptron performance Influence of weight initialization on multilayer perceptron performance M. Karouia (1,2) T. Denœux (1) R. Lengellé (1) (1) Université e Compiègne U.R.A. CNRS 817 Heuiasyc BP 649 - F-66 Compiègne ceex -

More information

Math 342 Partial Differential Equations «Viktor Grigoryan

Math 342 Partial Differential Equations «Viktor Grigoryan Math 342 Partial Differential Equations «Viktor Grigoryan 6 Wave equation: solution In this lecture we will solve the wave equation on the entire real line x R. This correspons to a string of infinite

More information

Calculus and optimization

Calculus and optimization Calculus an optimization These notes essentially correspon to mathematical appenix 2 in the text. 1 Functions of a single variable Now that we have e ne functions we turn our attention to calculus. A function

More information

Solutions to Math 41 Second Exam November 4, 2010

Solutions to Math 41 Second Exam November 4, 2010 Solutions to Math 41 Secon Exam November 4, 2010 1. (13 points) Differentiate, using the metho of your choice. (a) p(t) = ln(sec t + tan t) + log 2 (2 + t) (4 points) Using the rule for the erivative of

More information

Math 1B, lecture 8: Integration by parts

Math 1B, lecture 8: Integration by parts Math B, lecture 8: Integration by parts Nathan Pflueger 23 September 2 Introuction Integration by parts, similarly to integration by substitution, reverses a well-known technique of ifferentiation an explores

More information

Differentiation ( , 9.5)

Differentiation ( , 9.5) Chapter 2 Differentiation (8.1 8.3, 9.5) 2.1 Rate of Change (8.2.1 5) Recall that the equation of a straight line can be written as y = mx + c, where m is the slope or graient of the line, an c is the

More information

Lecture 5. Symmetric Shearer s Lemma

Lecture 5. Symmetric Shearer s Lemma Stanfor University Spring 208 Math 233: Non-constructive methos in combinatorics Instructor: Jan Vonrák Lecture ate: January 23, 208 Original scribe: Erik Bates Lecture 5 Symmetric Shearer s Lemma Here

More information

Topic 7: Convergence of Random Variables

Topic 7: Convergence of Random Variables Topic 7: Convergence of Ranom Variables Course 003, 2016 Page 0 The Inference Problem So far, our starting point has been a given probability space (S, F, P). We now look at how to generate information

More information

JUST THE MATHS UNIT NUMBER DIFFERENTIATION 2 (Rates of change) A.J.Hobson

JUST THE MATHS UNIT NUMBER DIFFERENTIATION 2 (Rates of change) A.J.Hobson JUST THE MATHS UNIT NUMBER 10.2 DIFFERENTIATION 2 (Rates of change) by A.J.Hobson 10.2.1 Introuction 10.2.2 Average rates of change 10.2.3 Instantaneous rates of change 10.2.4 Derivatives 10.2.5 Exercises

More information

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics

This module is part of the. Memobust Handbook. on Methodology of Modern Business Statistics This moule is part of the Memobust Hanbook on Methoology of Moern Business Statistics 26 March 2014 Metho: Balance Sampling for Multi-Way Stratification Contents General section... 3 1. Summary... 3 2.

More information

CHAPTER 1 : DIFFERENTIABLE MANIFOLDS. 1.1 The definition of a differentiable manifold

CHAPTER 1 : DIFFERENTIABLE MANIFOLDS. 1.1 The definition of a differentiable manifold CHAPTER 1 : DIFFERENTIABLE MANIFOLDS 1.1 The efinition of a ifferentiable manifol Let M be a topological space. This means that we have a family Ω of open sets efine on M. These satisfy (1), M Ω (2) the

More information

Stopping-Set Enumerator Approximations for Finite-Length Protograph LDPC Codes

Stopping-Set Enumerator Approximations for Finite-Length Protograph LDPC Codes Stopping-Set Enumerator Approximations for Finite-Length Protograph LDPC Coes Kaiann Fu an Achilleas Anastasopoulos Electrical Engineering an Computer Science Dept. University of Michigan, Ann Arbor, MI

More information

A Course in Machine Learning

A Course in Machine Learning A Course in Machine Learning Hal Daumé III 12 EFFICIENT LEARNING So far, our focus has been on moels of learning an basic algorithms for those moels. We have not place much emphasis on how to learn quickly.

More information

Polynomial Inclusion Functions

Polynomial Inclusion Functions Polynomial Inclusion Functions E. e Weert, E. van Kampen, Q. P. Chu, an J. A. Muler Delft University of Technology, Faculty of Aerospace Engineering, Control an Simulation Division E.eWeert@TUDelft.nl

More information

Leaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes

Leaving Randomness to Nature: d-dimensional Product Codes through the lens of Generalized-LDPC codes Leaving Ranomness to Nature: -Dimensional Prouct Coes through the lens of Generalize-LDPC coes Tavor Baharav, Kannan Ramchanran Dept. of Electrical Engineering an Computer Sciences, U.C. Berkeley {tavorb,

More information

The Exact Form and General Integrating Factors

The Exact Form and General Integrating Factors 7 The Exact Form an General Integrating Factors In the previous chapters, we ve seen how separable an linear ifferential equations can be solve using methos for converting them to forms that can be easily

More information

Agmon Kolmogorov Inequalities on l 2 (Z d )

Agmon Kolmogorov Inequalities on l 2 (Z d ) Journal of Mathematics Research; Vol. 6, No. ; 04 ISSN 96-9795 E-ISSN 96-9809 Publishe by Canaian Center of Science an Eucation Agmon Kolmogorov Inequalities on l (Z ) Arman Sahovic Mathematics Department,

More information

Table of Common Derivatives By David Abraham

Table of Common Derivatives By David Abraham Prouct an Quotient Rules: Table of Common Derivatives By Davi Abraham [ f ( g( ] = [ f ( ] g( + f ( [ g( ] f ( = g( [ f ( ] g( g( f ( [ g( ] Trigonometric Functions: sin( = cos( cos( = sin( tan( = sec

More information

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x)

The derivative of a function f(x) is another function, defined in terms of a limiting expression: f(x + δx) f(x) Y. D. Chong (2016) MH2801: Complex Methos for the Sciences 1. Derivatives The erivative of a function f(x) is another function, efine in terms of a limiting expression: f (x) f (x) lim x δx 0 f(x + δx)

More information

A Sketch of Menshikov s Theorem

A Sketch of Menshikov s Theorem A Sketch of Menshikov s Theorem Thomas Bao March 14, 2010 Abstract Let Λ be an infinite, locally finite oriente multi-graph with C Λ finite an strongly connecte, an let p

More information

QF101: Quantitative Finance September 5, Week 3: Derivatives. Facilitator: Christopher Ting AY 2017/2018. f ( x + ) f(x) f(x) = lim

QF101: Quantitative Finance September 5, Week 3: Derivatives. Facilitator: Christopher Ting AY 2017/2018. f ( x + ) f(x) f(x) = lim QF101: Quantitative Finance September 5, 2017 Week 3: Derivatives Facilitator: Christopher Ting AY 2017/2018 I recoil with ismay an horror at this lamentable plague of functions which o not have erivatives.

More information

Chapter 6: Energy-Momentum Tensors

Chapter 6: Energy-Momentum Tensors 49 Chapter 6: Energy-Momentum Tensors This chapter outlines the general theory of energy an momentum conservation in terms of energy-momentum tensors, then applies these ieas to the case of Bohm's moel.

More information

Lecture 6: Calculus. In Song Kim. September 7, 2011

Lecture 6: Calculus. In Song Kim. September 7, 2011 Lecture 6: Calculus In Song Kim September 7, 20 Introuction to Differential Calculus In our previous lecture we came up with several ways to analyze functions. We saw previously that the slope of a linear

More information

Qubit channels that achieve capacity with two states

Qubit channels that achieve capacity with two states Qubit channels that achieve capacity with two states Dominic W. Berry Department of Physics, The University of Queenslan, Brisbane, Queenslan 4072, Australia Receive 22 December 2004; publishe 22 March

More information

Discrete Mathematics

Discrete Mathematics Discrete Mathematics 309 (009) 86 869 Contents lists available at ScienceDirect Discrete Mathematics journal homepage: wwwelseviercom/locate/isc Profile vectors in the lattice of subspaces Dániel Gerbner

More information

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION

LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION The Annals of Statistics 1997, Vol. 25, No. 6, 2313 2327 LATTICE-BASED D-OPTIMUM DESIGN FOR FOURIER REGRESSION By Eva Riccomagno, 1 Rainer Schwabe 2 an Henry P. Wynn 1 University of Warwick, Technische

More information

Flexible High-Dimensional Classification Machines and Their Asymptotic Properties

Flexible High-Dimensional Classification Machines and Their Asymptotic Properties Journal of Machine Learning Research 16 (2015) 1547-1572 Submitte 1/14; Revise 9/14; Publishe 8/15 Flexible High-Dimensional Classification Machines an Their Asymptotic Properties Xingye Qiao Department

More information

Lower bounds on Locality Sensitive Hashing

Lower bounds on Locality Sensitive Hashing Lower bouns on Locality Sensitive Hashing Rajeev Motwani Assaf Naor Rina Panigrahy Abstract Given a metric space (X, X ), c 1, r > 0, an p, q [0, 1], a istribution over mappings H : X N is calle a (r,

More information

On colour-blind distinguishing colour pallets in regular graphs

On colour-blind distinguishing colour pallets in regular graphs J Comb Optim (2014 28:348 357 DOI 10.1007/s10878-012-9556-x On colour-blin istinguishing colour pallets in regular graphs Jakub Przybyło Publishe online: 25 October 2012 The Author(s 2012. This article

More information

SYNCHRONOUS SEQUENTIAL CIRCUITS

SYNCHRONOUS SEQUENTIAL CIRCUITS CHAPTER SYNCHRONOUS SEUENTIAL CIRCUITS Registers an counters, two very common synchronous sequential circuits, are introuce in this chapter. Register is a igital circuit for storing information. Contents

More information

Make graph of g by adding c to the y-values. on the graph of f by c. multiplying the y-values. even-degree polynomial. graph goes up on both sides

Make graph of g by adding c to the y-values. on the graph of f by c. multiplying the y-values. even-degree polynomial. graph goes up on both sides Reference 1: Transformations of Graphs an En Behavior of Polynomial Graphs Transformations of graphs aitive constant constant on the outsie g(x) = + c Make graph of g by aing c to the y-values on the graph

More information

REAL ANALYSIS I HOMEWORK 5

REAL ANALYSIS I HOMEWORK 5 REAL ANALYSIS I HOMEWORK 5 CİHAN BAHRAN The questions are from Stein an Shakarchi s text, Chapter 3. 1. Suppose ϕ is an integrable function on R with R ϕ(x)x = 1. Let K δ(x) = δ ϕ(x/δ), δ > 0. (a) Prove

More information

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences. S 63 Lecture 8 2/2/26 Lecturer Lillian Lee Scribes Peter Babinski, Davi Lin Basic Language Moeling Approach I. Special ase of LM-base Approach a. Recap of Formulas an Terms b. Fixing θ? c. About that Multinomial

More information

Introduction to variational calculus: Lecture notes 1

Introduction to variational calculus: Lecture notes 1 October 10, 2006 Introuction to variational calculus: Lecture notes 1 Ewin Langmann Mathematical Physics, KTH Physics, AlbaNova, SE-106 91 Stockholm, Sween Abstract I give an informal summary of variational

More information

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation

Thermal conductivity of graded composites: Numerical simulations and an effective medium approximation JOURNAL OF MATERIALS SCIENCE 34 (999)5497 5503 Thermal conuctivity of grae composites: Numerical simulations an an effective meium approximation P. M. HUI Department of Physics, The Chinese University

More information

Linear and quadratic approximation

Linear and quadratic approximation Linear an quaratic approximation November 11, 2013 Definition: Suppose f is a function that is ifferentiable on an interval I containing the point a. The linear approximation to f at a is the linear function

More information

Introduction to the Vlasov-Poisson system

Introduction to the Vlasov-Poisson system Introuction to the Vlasov-Poisson system Simone Calogero 1 The Vlasov equation Consier a particle with mass m > 0. Let x(t) R 3 enote the position of the particle at time t R an v(t) = ẋ(t) = x(t)/t its

More information

Equilibrium in Queues Under Unknown Service Times and Service Value

Equilibrium in Queues Under Unknown Service Times and Service Value University of Pennsylvania ScholarlyCommons Finance Papers Wharton Faculty Research 1-2014 Equilibrium in Queues Uner Unknown Service Times an Service Value Laurens Debo Senthil K. Veeraraghavan University

More information

Generalized Tractability for Multivariate Problems

Generalized Tractability for Multivariate Problems Generalize Tractability for Multivariate Problems Part II: Linear Tensor Prouct Problems, Linear Information, an Unrestricte Tractability Michael Gnewuch Department of Computer Science, University of Kiel,

More information

Assignment 1. g i (x 1,..., x n ) dx i = 0. i=1

Assignment 1. g i (x 1,..., x n ) dx i = 0. i=1 Assignment 1 Golstein 1.4 The equations of motion for the rolling isk are special cases of general linear ifferential equations of constraint of the form g i (x 1,..., x n x i = 0. i=1 A constraint conition

More information

Necessary and Sufficient Conditions for Sketched Subspace Clustering

Necessary and Sufficient Conditions for Sketched Subspace Clustering Necessary an Sufficient Conitions for Sketche Subspace Clustering Daniel Pimentel-Alarcón, Laura Balzano 2, Robert Nowak University of Wisconsin-Maison, 2 University of Michigan-Ann Arbor Abstract This

More information

MODELLING DEPENDENCE IN INSURANCE CLAIMS PROCESSES WITH LÉVY COPULAS ABSTRACT KEYWORDS

MODELLING DEPENDENCE IN INSURANCE CLAIMS PROCESSES WITH LÉVY COPULAS ABSTRACT KEYWORDS MODELLING DEPENDENCE IN INSURANCE CLAIMS PROCESSES WITH LÉVY COPULAS BY BENJAMIN AVANZI, LUKE C. CASSAR AND BERNARD WONG ABSTRACT In this paper we investigate the potential of Lévy copulas as a tool for

More information

Iterated Point-Line Configurations Grow Doubly-Exponentially

Iterated Point-Line Configurations Grow Doubly-Exponentially Iterate Point-Line Configurations Grow Doubly-Exponentially Joshua Cooper an Mark Walters July 9, 008 Abstract Begin with a set of four points in the real plane in general position. A to this collection

More information

TMA 4195 Matematisk modellering Exam Tuesday December 16, :00 13:00 Problems and solution with additional comments

TMA 4195 Matematisk modellering Exam Tuesday December 16, :00 13:00 Problems and solution with additional comments Problem F U L W D g m 3 2 s 2 0 0 0 0 2 kg 0 0 0 0 0 0 Table : Dimension matrix TMA 495 Matematisk moellering Exam Tuesay December 6, 2008 09:00 3:00 Problems an solution with aitional comments The necessary

More information

A Weak First Digit Law for a Class of Sequences

A Weak First Digit Law for a Class of Sequences International Mathematical Forum, Vol. 11, 2016, no. 15, 67-702 HIKARI Lt, www.m-hikari.com http://x.oi.org/10.1288/imf.2016.6562 A Weak First Digit Law for a Class of Sequences M. A. Nyblom School of

More information

Lecture 2 Lagrangian formulation of classical mechanics Mechanics

Lecture 2 Lagrangian formulation of classical mechanics Mechanics Lecture Lagrangian formulation of classical mechanics 70.00 Mechanics Principle of stationary action MATH-GA To specify a motion uniquely in classical mechanics, it suffices to give, at some time t 0,

More information

Parameter estimation: A new approach to weighting a priori information

Parameter estimation: A new approach to weighting a priori information Parameter estimation: A new approach to weighting a priori information J.L. Mea Department of Mathematics, Boise State University, Boise, ID 83725-555 E-mail: jmea@boisestate.eu Abstract. We propose a

More information

Multi-View Clustering via Canonical Correlation Analysis

Multi-View Clustering via Canonical Correlation Analysis Technical Report TTI-TR-2008-5 Multi-View Clustering via Canonical Correlation Analysis Kamalika Chauhuri UC San Diego Sham M. Kakae Toyota Technological Institute at Chicago ABSTRACT Clustering ata in

More information

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs

Lectures - Week 10 Introduction to Ordinary Differential Equations (ODES) First Order Linear ODEs Lectures - Week 10 Introuction to Orinary Differential Equations (ODES) First Orer Linear ODEs When stuying ODEs we are consiering functions of one inepenent variable, e.g., f(x), where x is the inepenent

More information

Monte Carlo Methods with Reduced Error

Monte Carlo Methods with Reduced Error Monte Carlo Methos with Reuce Error As has been shown, the probable error in Monte Carlo algorithms when no information about the smoothness of the function is use is Dξ r N = c N. It is important for

More information

TEMPORAL AND TIME-FREQUENCY CORRELATION-BASED BLIND SOURCE SEPARATION METHODS. Yannick DEVILLE

TEMPORAL AND TIME-FREQUENCY CORRELATION-BASED BLIND SOURCE SEPARATION METHODS. Yannick DEVILLE TEMPORAL AND TIME-FREQUENCY CORRELATION-BASED BLIND SOURCE SEPARATION METHODS Yannick DEVILLE Université Paul Sabatier Laboratoire Acoustique, Métrologie, Instrumentation Bât. 3RB2, 8 Route e Narbonne,

More information

Math 1271 Solutions for Fall 2005 Final Exam

Math 1271 Solutions for Fall 2005 Final Exam Math 7 Solutions for Fall 5 Final Eam ) Since the equation + y = e y cannot be rearrange algebraically in orer to write y as an eplicit function of, we must instea ifferentiate this relation implicitly

More information

1 dx. where is a large constant, i.e., 1, (7.6) and Px is of the order of unity. Indeed, if px is given by (7.5), the inequality (7.

1 dx. where is a large constant, i.e., 1, (7.6) and Px is of the order of unity. Indeed, if px is given by (7.5), the inequality (7. Lectures Nine an Ten The WKB Approximation The WKB metho is a powerful tool to obtain solutions for many physical problems It is generally applicable to problems of wave propagation in which the frequency

More information

Logarithmic spurious regressions

Logarithmic spurious regressions Logarithmic spurious regressions Robert M. e Jong Michigan State University February 5, 22 Abstract Spurious regressions, i.e. regressions in which an integrate process is regresse on another integrate

More information

Algorithms and matching lower bounds for approximately-convex optimization

Algorithms and matching lower bounds for approximately-convex optimization Algorithms an matching lower bouns for approximately-convex optimization Yuanzhi Li Department of Computer Science Princeton University Princeton, NJ, 08450 yuanzhil@cs.princeton.eu Anrej Risteski Department

More information

Monotonicity of facet numbers of random convex hulls

Monotonicity of facet numbers of random convex hulls Monotonicity of facet numbers of ranom convex hulls Gilles Bonnet, Julian Grote, Daniel Temesvari, Christoph Thäle, Nicola Turchi an Florian Wespi arxiv:173.31v1 [math.mg] 7 Mar 17 Abstract Let X 1,...,

More information

Vectors in two dimensions

Vectors in two dimensions Vectors in two imensions Until now, we have been working in one imension only The main reason for this is to become familiar with the main physical ieas like Newton s secon law, without the aitional complication

More information

Lecture XII. where Φ is called the potential function. Let us introduce spherical coordinates defined through the relations

Lecture XII. where Φ is called the potential function. Let us introduce spherical coordinates defined through the relations Lecture XII Abstract We introuce the Laplace equation in spherical coorinates an apply the metho of separation of variables to solve it. This will generate three linear orinary secon orer ifferential equations:

More information

Lower Bounds for Local Monotonicity Reconstruction from Transitive-Closure Spanners

Lower Bounds for Local Monotonicity Reconstruction from Transitive-Closure Spanners Lower Bouns for Local Monotonicity Reconstruction from Transitive-Closure Spanners Arnab Bhattacharyya Elena Grigorescu Mahav Jha Kyomin Jung Sofya Raskhonikova Davi P. Wooruff Abstract Given a irecte

More information

Perfect Matchings in Õ(n1.5 ) Time in Regular Bipartite Graphs

Perfect Matchings in Õ(n1.5 ) Time in Regular Bipartite Graphs Perfect Matchings in Õ(n1.5 ) Time in Regular Bipartite Graphs Ashish Goel Michael Kapralov Sanjeev Khanna Abstract We consier the well-stuie problem of fining a perfect matching in -regular bipartite

More information

On the Aloha throughput-fairness tradeoff

On the Aloha throughput-fairness tradeoff On the Aloha throughput-fairness traeoff 1 Nan Xie, Member, IEEE, an Steven Weber, Senior Member, IEEE Abstract arxiv:1605.01557v1 [cs.it] 5 May 2016 A well-known inner boun of the stability region of

More information

How to Minimize Maximum Regret in Repeated Decision-Making

How to Minimize Maximum Regret in Repeated Decision-Making How to Minimize Maximum Regret in Repeate Decision-Making Karl H. Schlag July 3 2003 Economics Department, European University Institute, Via ella Piazzuola 43, 033 Florence, Italy, Tel: 0039-0-4689, email:

More information

Approximate Constraint Satisfaction Requires Large LP Relaxations

Approximate Constraint Satisfaction Requires Large LP Relaxations Approximate Constraint Satisfaction Requires Large LP Relaxations oah Fleming April 19, 2018 Linear programming is a very powerful tool for attacking optimization problems. Techniques such as the ellipsoi

More information

On the Surprising Behavior of Distance Metrics in High Dimensional Space

On the Surprising Behavior of Distance Metrics in High Dimensional Space On the Surprising Behavior of Distance Metrics in High Dimensional Space Charu C. Aggarwal, Alexaner Hinneburg 2, an Daniel A. Keim 2 IBM T. J. Watson Research Center Yortown Heights, NY 0598, USA. charu@watson.ibm.com

More information

6 General properties of an autonomous system of two first order ODE

6 General properties of an autonomous system of two first order ODE 6 General properties of an autonomous system of two first orer ODE Here we embark on stuying the autonomous system of two first orer ifferential equations of the form ẋ 1 = f 1 (, x 2 ), ẋ 2 = f 2 (, x

More information

arxiv:hep-th/ v1 3 Feb 1993

arxiv:hep-th/ v1 3 Feb 1993 NBI-HE-9-89 PAR LPTHE 9-49 FTUAM 9-44 November 99 Matrix moel calculations beyon the spherical limit arxiv:hep-th/93004v 3 Feb 993 J. Ambjørn The Niels Bohr Institute Blegamsvej 7, DK-00 Copenhagen Ø,

More information

Permanent vs. Determinant

Permanent vs. Determinant Permanent vs. Determinant Frank Ban Introuction A major problem in theoretical computer science is the Permanent vs. Determinant problem. It asks: given an n by n matrix of ineterminates A = (a i,j ) an

More information

A Review of Multiple Try MCMC algorithms for Signal Processing

A Review of Multiple Try MCMC algorithms for Signal Processing A Review of Multiple Try MCMC algorithms for Signal Processing Luca Martino Image Processing Lab., Universitat e València (Spain) Universia Carlos III e Mari, Leganes (Spain) Abstract Many applications

More information

PETER L. BARTLETT AND MARTEN H. WEGKAMP

PETER L. BARTLETT AND MARTEN H. WEGKAMP CLASSIFICATION WITH A REJECT OPTION USING A HINGE LOSS PETER L. BARTLETT AND MARTEN H. WEGKAMP Abstract. We consier the problem of binary classification where the classifier can, for a particular cost,

More information

Calculus of Variations

Calculus of Variations 16.323 Lecture 5 Calculus of Variations Calculus of Variations Most books cover this material well, but Kirk Chapter 4 oes a particularly nice job. x(t) x* x*+ αδx (1) x*- αδx (1) αδx (1) αδx (1) t f t

More information