Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Size: px
Start display at page:

Download "Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material"

Transcription

1 Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable A n, λ(a P(U A. Also, η : X n is the apping that gives the conditional probability vector η(x [P(Y X x,..., P(Y n X x n for a given instance x X. Let ν be the probability easure over the siplex induced by the rando variable η(x; in particular, for all easurable A n, ν(a P X µ (η(x A. For a atrix L [0, n n we let l, l 2,..., l n be the coluns of L. For any set A R d, the set A denotes its closure. For any vector v R n, we let (v (i denote the i th eleent when the coponents of v are sorted in ascending order. For any y [n, we shall denote rand(y [(y,..., (y n. A. Suppleentary Material For Section 2 (Coplex Perforance Measures A.. Details of Micro F -easure We consider the for of the icro F used in the BioNLP challenge (Ki et al., 203, which treats class as a default class (in inforation extraction, this class pertains to exaples for which no inforation is required to be extracted. One can then define the icro precision of a classifier h : X [n with confusion atrix C C D [h as the probability of an instance being correctly labelled, given that it was assigned by h a class other than : icroprec(c P(h(X Y h(x n i2 C ii n n i2 j C ji n i2 C ii n i C i Siilarly, the icro recall of h can be defined as the probability of an instance being correctly labelled, given that its true class was not : n i2 icrorec(c P(h(X Y Y C n ii n i2 n i2 j C C ii ij n i C. i The icro F that we analyze is the haronic ean of the icro precision and icro recall given above: ψ icrof (C 2 icroprec(c icrorec(c icroprec(c + icrorec(c 2 n i2 C ii 2 n i C i n i C. i Note that the above perforance easure can be written as a ratio-of-linear function: ψ icrof (C A,C B,C, where A 0, A ii 2, i, A ij 0, i j, and B 0, B i B i, i, B ij 2, i j. Also, note that this perforance easure satisfies the condition in Theore 7 with sup C CD ψ icrof (C, and in C CD B, C π > 0, and hence Algorith 2 is consistent for this perforance easure. Recently, Parabath et al. (204 also considered a for of icro F siilar to that used in the BioNLP challenge. The expression they use is slightly sipler than ours and differs slightly fro the BioNLP perforance easure: ψ icrof (C 2 n i2 C ii + n i2 C ii C. Another popular variant of the icro F involves averaging the entries of the one-versus-all binary confusion atrices for all classes, and coputing the F for the averaged atrix; as pointed out by Manning et al. (2008, this for of icro F effectively reduces to the 0- classification accuracy. B. Suppleentary Material for Section 3 (Bayes Optial Classifiers B.. Exaple Distribution Where the Optial Classifier Needs to be Randoized We present an exaple distribution where the optial perforance for the G-ean easure can be achieved only by a randoized classifier.

2 Exaple 5 (Distribution where the optial classifier needs to be randoized. Let D be a distribution over {x} {, 2} with η (x η 2 (x 2 and suppose we are interested in finding the optial classifier for the G-ean perforance easure (see Exaple 3 under D. The two deterinistic classifiers for this setting, naely, one which predicts on x and the other that predicts 2 on x, yield a G-ean of 0. However, the randoized classifier h (x [ 2, 2 has a G-ean value of 4 > 0 and can be verified to be the unique optial classifier for G-ean under D. We next present the proofs for the theores/leas/propositions in Section 3. B.2. Proof of Theore Theore (For of Bayes optial classifier for ratio-of-linear ψ. Let ψ : [0, n n R + be a ratio-of-linear perforance easure of the for ψ(c A,C B,C for soe A, B Rn n with B, C > 0 C C D. Let t D Pψ, D. Let L (A t D B, and let L [0, n n be obtained by scaling and shifting L so its entries lie in [0,. Then any classifier that is ψ L -optial is also ψ-optial. In the following, we oit the subscript on t D for easy of presentation. We first state the following lea using which we prove the above theore. Lea 8. Let ψ : [0, n n R + be such that ψ(c A,C B,C, for soe atrices A, B Rn n with B, C > 0 for all C C D. Let t sup C CD ψ(c. Then sup C CD A t B, C 0. Proof. Define φ : R R as φ(t sup C CD A tb, C. It is easy to see that φ (being a point-wise supreu of linear functions is convex, and hence a continuous function over R. By definition of t, we have for all C C D, Thus A, C B, C t or equivalently φ(t A t B, C 0. Also, for any t < t, there exists C C D such that φ(t sup A t B, C 0. (2 Thus for all t < t, A, C B, C > t or equivalently φ(t A tb, C > 0. φ(t sup A tb, C > 0. Next, by continuity of φ, for any onotonically increasing sequence of real nubers {t i } i converging to t, we have that φ(t i converges to φ(t ; since for each t i in this sequence φ(t i > 0, at the t, we have that φ(t 0. Along with Eq. (2, this gives us sup A t B, C φ(t 0. We next give the proof for Theore Proof of Theore. Let h : X n be a ψ L -optial classifier. We shall show that h is also ψ-optial, which will also iply existence of the ψ-optial classifier. Then we have L, C D [h sup h:x n L, C D [h sup L, C. Since L is a scaled and translated version of L A t B (where t P ψ,, we further have A t B, C D [h sup A t B, C.

3 Now, fro Lea 8 we know that A t B, C D [h 0. Hence, or equivalently, A, C D [h B, C D [h t, ψ(c D [h Thus h is also ψ-optial, which copletes the proof. B.3. Proof of Proposition 0 Proposition. C D is a convex set. sup ψ(c. Proof. Let C, C 2 C D. Let λ [0,. We will show that λc + ( λc 2 C D. By definition of C D, there exists randoized classifiers h, h 2 : X n such that C C D [h and C 2 C D [h 2. Consider the randoized classifier h λ : X n defined as h λ (x λh (x + ( λh 2 (x. It can be seen that C D [h λ λc + ( λc 2. B.4. Supporting Technical Leas For Lea 2 and Theore3 In this subsection we give soe supporting technical leas which will be useful in the proofs for Lea 2 and Theore 3. Lea 9 (Confusion atrix as an integration. Let f : n n. Then C D [f η p(f(p dν(p. p n Proof. C D i,j[f η E (X,Y D [f j (η(x (Y i E p ν E (X,Y D [ fj (p (Y i η(x p E p ν [ pi f j (p. Proposition 20 (Sufficiency of conditional probability. Let D be a distribution over X Y. For any randoized classifier h : X n there exists another randoized classifier h : X n such that C D [h C D [h and h is such that h f η, for soe f : n n. Proof. Let h : X n. Define f : n n as follows, We then have for any i, j [n that, f(p E X µ [h(x η(x p. C D i,j[h E (X,Y D [h j (X (Y i E p ν E (X,Y D [h j (X (Y i η(x p

4 E p ν [ E(X,Y D [h j (X η(x p E (X,Y D [(Y i η(x p E p ν [ fj (p p i C D i,j[f η where the third equality follows because, given η(x, the rando variables X and Y are independent, and the last inequality follows fro Lea 9. Lea 2 (Continuity of the C D apping. Let D be a distribution over X Y. Let f, f 2 : n n. Then C D [f η C D [f 2 η f (p f 2 (p dν(p. p n Proof. Let f, f 2 : n n C D [f η C D [f 2 η C D [f η C D [f 2 η p(f (p f 2 (p dν(p p n p(f (p f 2 (p dν(p p n p f (p f 2 (p dν(p p n f (p f 2 (p dν(p. p n Lea 22 (Volue of a inverse linear ap of an interval. Let d > 0 be any integer. Let V R d be copact and convex. Let f : R d R be an affine function such that it is non-constant over V. Let V be a vector valued rando variable taking values uniforly over V. Then, there exists a constant α > 0 such that for all c R and ɛ R + we have P(f(V [c, c + ɛ αɛ. Proof. Let us assue for now that affine hull of V is the entire space R d. For any integer i and set A, let vol i (A denote the i-th diensional volue of the set A. Note that vol i (A is undefined if the affine-hull diension of A is greater than i and is equal to zero if the affine-hull diension of A is lesser than i. For any r > 0 and any integer i > 0 let B i (r R i denote the set B i (r {x R i : x 2 r}. Also let R be the sallest value such that V B d (R. Let the affine function f be such that for all x R d, the value f(x g x + u. By the assuption of non-constancy of f on V we have that g 0. We now have that P(f(V [c, c + ɛ vol ( d {v V : c u g v c u + ɛ} vol d (V vol ( d {v Bd (R : c u g v c u + ɛ} vol d (V ɛ vol ( d Bd (R. vol d (V g 2 The last inequality follows fro the observation that d-volue of a strip of a d diensional sphere of radius r is at ost the d volue of a d diensional sphere of radius r ties the width of the strip, and the width of the strip under consideration here is siply ɛ g 2. Finally, if the affine hull of V is not the entire space R d, one can siply consider the affine-hull of V to be the entire (lesser diensional space and all the above arguents hold with soe affine transforations and a saller d.

5 Lea 23 (Fraction of instances with the best and second best prediction being siilar in perforance is sall. Let L [0, n n be such that no two coluns are identical. Let the distribution D over X Y be such that the easure over conditional probabilities ν, is absolutely continuous w.r.t. the base easure λ. Let c 0. Let A c n be the set A c {p n : (p L (2 (p L ( c} Let r : R + R + be the function defined as Then r(c ν(a c. (a r is a onotonically increasing function. (b There exists a C > 0 such that r is a continuous function over [0, C. (c r(0 0. Proof. Part (a: The fact that r is a onotonically increasing function is iediately obvious fro the observation that A a A b for any a < b. Part (b: Let C 2 in{d R : l y l y de for soe y, y [n, y y }, where e is the all ones vector. If there exists no y, y such that l y l y is a scalar ultiple of e, then we siply set C. Note that by our assuption on unequal coluns on L, we always have C > 0. For any c > 0 and y, y [n with y y, define the set A y,y c For any c, ɛ > 0, it can be clearly seen that A y,y c {p n : p l y p l y c}. ν(a c+ɛ ν(a c ν(a c+ɛ \ A c, A c+ɛ \ A c c+ɛ \ A y,y c, ν(a c+ɛ \ A c as y,y [n,y y y,y [n,y y ν c+ɛ \ A y,y c. Hence, our proof for continuity of r would be coplete, if we show that ν c+ɛ \ A y,y c goes to zero as ɛ goes to zero for all y y and c [0, C. Let c [0, C and y, y [n with y y A y,y c+ɛ \ A y,y c {p n : c < p (l y l y c + ɛ}. If l y l y de for soe d, we have that p (l y l y d and d > C by definition of C. Hence for sall enough ɛ the set A y,y c+ɛ \ A y,y c is epty. If l y l y is not a scalar ultiple of e, then p (l y l y is a non-constant linear function of p over n. Fro Lea 22, λ c+ɛ \ Ac y,y goes to zero as ɛ goes to zero. And by the absolute continuity of ν w.r.t. λ, we have ν c+ɛ \ A y,y c goes to zero as ɛ goes to zero. As the above arguents hold for any c [0, C and y, y [n with y y, the proof of part (b is coplete.

6 Part (c: We have, Consistent Multiclass Algoriths for Coplex Perforance Measures A 0 y,y [n,y y 0 A y,y 0. To show r(0 0, we show λ 0 A y,y 0 0 for all y y. Let y, y [n with y y, then 0 A y,y 0 {p n : p (l y l y 0}. If l y l y de for soe d 0, the above set is clearly epty. If l y l y is not a scalar ultiple of e, then p (l y l y is a non-constant linear function of p over n, and hence by Lea 22, we have that λ 0 A y,y 0 0. By the absolute continuity of ν w.r.t. λ we have that ν 0 A y,y 0 0. As the above arguents hold for any y, y [n with y y, the proof of part (c is coplete. Lea 24 (The uniqueness of ψ L -optial classifier. Let the distribution D over X Y be such that the easure over conditional probabilities ν, is absolutely continuous w.r.t. the base easure λ. Let L R n n be such that no two coluns are identical. Then, all ψ L optial classifiers have the sae confusion atrix. i.e. the iniizer over C D of L, C is unique. Proof. If x X is such that argin y [n η(x l y is a singleton, then any ψ L -optial classifier h is such that h (x argin y [n η(x l y We just show that the set of instances in X, such that argin y [n η(x l y is not a singleton, has easure zero. For any v R n, let (v (i be the i th eleent when the coponents of v are arranged in ascending order. ( µ {x X : argin y [n η(x l y > } µ ( {x X : (η(x L ( (η(x L (2 } ν ( {p n : (p L ( (p L (2 } Thus, by Lea 23 (part c, we have that set of instances in X such that argin y [n η(x l y is not a singleton, has easure zero. Thus any pair of ψ L -optial classifiers are sae µ alost everywhere, and hence all the ψ L -optial classifiers have the sae confusion atrix. Next we give the aster Lea which uses every result in this section, and will actually be the only tool in the proofs of Lea 2 and Theore 3. Lea 25 (Master Lea. Let the distribution D over X Y be such that the easure over conditional probabilities ν, be absolutely continuous w.r.t. the base easure λ. Let L [0, n n be such that no two coluns are identical. Then, Moreover, the above set is a singleton. argin C CD L, C argin C CD L, C. Proof. The first part of the proof where one shows argin C CD L, C is a singleton is exactly what is given by Lea 24. Let C C D [h argin C CD L, C. The classifier h : X n is such that C D [h C, and is fixed for convenience as the following classifier, h (x rand(argin y [n η(x l y.

7 Also let f : n n be such that, h f η, i.e. Consistent Multiclass Algoriths for Coplex Perforance Measures f (p rand(argin y [n p l y. In the rest of the proof we siply show that C is the unique iniizer of L, C over C D as well. To do so, we first assue that C argin C CD L, C and C C. We then go on to show a contradiction, the brief details of which are given below:. As C C, we have C C ξ > As C C D, there exists a sequence of classifiers h, h 2,..., such that their confusion atrices converge to C. 3. The confusion atrices of the classifiers h, h 2,..., are bounded away fro C as ξ > Due to the continuity of the C D apping (Lea 2, the classifiers h, h 2,... are also bounded away fro h i.e. they ust predict differently fro h on a significant fraction of the instances. 5. Due to Lea 23, we have that for ost instances the second best prediction (in ters of loss L is significantly worse than the best prediction. The classifiers h, h 2,... all predict differently fro h (which always predicts the best label for any given instance for a large fraction of the instances, hence they ust predict a significantly worse label for a large fraction of instances. 6. Fro the above reasoning, the classifiers h, h 2,..., all perfor worse by a constant additive factor than h, on the ψ L perforance easure. But, as the the confusion atrices of these classifiers converge to C, the ψ L perforance of these classifiers ust approach the optial. Thus providing a contradiction. The full details of the above sketch is given below. Let C C ξ > 0. As C C D, we have that for all ɛ > 0, there exists C ɛ C D, such that C ɛ C ɛ. By triangle inequality, this iplies that C ɛ C ξ ɛ, (3 Let f ɛ : n n be s.t. C ɛ C D [f ɛ η. Now we describe the set of conditional probabilities p n for which f ɛ (p differs significantly fro f (p. Denote this bad set as B. Let B {p n : f (p f ɛ (p ξ 4 }. We now show that this set is large. Applying Eq. 3 and Lea 2 we have ξ ɛ C D [f η C D [f ɛ η f (p f ɛ (p dν(p p n ξ 2dν(p + 4 dν(p p B p / B 2ν(B + ξ ( ν(b 4 2ν(B + ξ 4 ν(b 3ξ 8 ɛ 2 (4

8 Now we consider the set of conditional probabilities p n, such that the second best and best predictions (in ters of L are close in perforance. We show that this set is sall. For any c > 0, define A c n as A c {p n : (p L (2 (p L ( c}. Fro Lea 23 we have that ν(a c is a continuous function of c close to 0 and ν(a 0 0. Let c > 0 be such that Fro Eq. 4 and 5, we have ν(a c ξ 6. (5 ν(b \ A c 5ξ 6 ɛ 2 Any p B \ A c is such that f ɛ (p is different fro f (p, and the second best prediction is significantly worse than the best prediction, i.e. (p L (2 (p L ( > c and f (p f ɛ (p ξ 4. For any p n, we have f (p n, has a at the index argin y [n p l y and zero elsewhere. For any p B \ A c, we have f (p f ɛ (p ξ 4, and hence the value of f ɛ(p corresponding to the index argin y [n p l y, is at ost ( ξ 8. In particular, we have p Lf ɛ (p ( ξ (p L ( + 8 ( ξ (p L (2. (6 8 By using Lea 9 and Eq. 6 we have, L, C ɛ L, C p L(f ɛ (p f (pdν(p p n p L(f ɛ (p f (pdν(p + p L(f ɛ (p f (pdν(p p B\A c p n\(b\a c p L(f ɛ (p f (pdν(p p B\A c (( ξ ( ξ (p L ( + (p L (2 (p L ( dν(p p B\A c 8 8 ξ ( (p L (2 (p L ( dν(p p B\A c 8 ξc 8 ( 5ξ 6 ɛ 2 If ɛ ξ 2, we have The above holds for any ɛ (0, ξ 2, and both ξ and c do not depend on ɛ. However, we have C argin C CD L, C, and C ɛ C ɛ. Hence, L, C ɛ L, C ξ2 c 28. (7 L, C ɛ L, C + L, C ɛ C L, C + L C ɛ C

9 L, C + ɛ in L, C + ɛ L, C + ɛ (8 It can be clearly seen that, for sall enough ɛ, Eqs. 7 and 8 contradict each other. And thus we have that C C. B.5. Proof of Lea 2 Lea (Existence of Bayes optial classifier for onotonic ψ. Let D be such that the probability easure associated with the rando vector η(x (η (X,..., η n (X is absolutely continuous w.r.t. the base probability easure associated with the unifor distribution over n, and let ψ be a perforance easure that is differentiable and bounded over C D, and is onotonically increasing in C ii for each i and non-increasing in C ij for all i, j. Then h : X n s.t. P ψ D [h P ψ, D. Proof. Let C argax C CD ψ(c. Such a C always exists by copactness of C D and continuity of ψ. We will show that this C is also in C D, thus proving the existence of h : X n which is such that C C D [h and hence P ψ D [h P ψ, D. By first order optiality, and convexity of C D, we have that for all C C D ψ(c, C ψ(c, C. Let L be the scaled and shifted version of ψ(c with entries in [0,, then we have that C argin C CD L, C. Due to the onotonicity condition on ψ the diagonal eleents of its gradient ψ(c are positive, and the off-diagonal eleents are non-positive, and hence no two coluns of L are identical. Thus by Lea 25, we have that C C D. B.6. Proof of Theore 3 Theore (For of Bayes optial classifier for onotonic ψ. Let D, ψ satisfy the conditions of Lea 2. Let h : X n be a ψ-optial classifier and let C C D [h. Let L ψ(c, and let L [0, n n be obtained by scaling and shifting L so its entries lie in [0,. Then any classifier that is ψ L -optial is also ψ-optial. Proof. Clearly ψ(c ax ψ(c. Hence by the differentiability of ψ, first order conditions for optiality and convexity of C D we have C C D, ψ(c, C ψ(c, C. By definition of L, this iplies that C C D, L, C L, C. Thus, we have that h is a ψ L -optial classifier. Due to the onotonicity condition on ψ the diagonal eleents of its gradient ψ(c are positive, and the off-diagonal eleents are non-positive, and hence no two coluns of L are identical. By Lea 24 (or Lea 25, we have that all ψ L -optial classifiers have the sae confusion atrix, which is equal to C D [h C. And thus all ψ L optial classifiers are also ψ-optial.

10 C. Suppleentary Material for Section 5 (Consistency C.. Proof of Lea 4 Lea (L-regret of ulticlass plug-in classifiers. For a fixed L [0, n n and class probability estiation odel η : X n, let ĥ : X [n be a classifier ĥ(x argin j [n n i η i(xl ij. Then P L, D PL D[ĥ E X[ η(x η(x. Proof. Let h : X n be such that By Proposition 6, we have that h (x argin y [n l y η(x. h argax h:x n P L D[h. We then have as desired. P L, D PL D[ĥ L, CD [ĥ L, CD [h [ E X η(x [ lĥ(x EX η(x l h (X E X [ η(x [ lĥ(x + EX (η(x η(x [ lĥ(x EX η(x l h (X E X [ η(x [ l h (X + EX (η(x η(x [ lĥ(x EX η(x l h (X [ E X (η(x η(x (lĥ(x l h (X [ E X η(x η(x lĥ(x l h (X C.2. Proof of Lea 5 E X [ η(x η(x, Lea (Unifor convergence of confusion atrices. For q : X n, let H q {h:x [n, h(x argin j [n n q i (xl ij L [0, n n }. Let S (X [n be a saple drawn i.i.d. fro D. For any δ (0,, w.p. at least δ (over draw of S fro D, sup C D [h ĈS [h n 2 log(n log( + log(n 2 /δ C, h H q where C > 0 is a distribution-independent constant. Proof. For any a, b [n we have, sup Ĉa,b[h S Ca,b[h D sup h H q h H q sup h H b q i ((y i a, h(x i b E[(Y a, h(x b ((y i a, h(x i E[(Y a, h(x, i i where for a fixed b [n, H b q {h : X {0, } : L [0, n n, x X, h(x (b argin y [n l y q(x}. The set H b q can be seen as hypothesis class whose concepts are the intersection of n halfspaces in R n (corresponding to q(x

11 through the origin. Hence we have fro Lea of Bluer et al. (989 that the VC-diension of H b q is at ost 2n 2 log(3n. Fro standard unifor convergence arguents we have that for each a, b [n, the following holds with at least probability δ, sup Ĉa,b[h S Ca,b[h D n 2 log(n log( + log( δ C h H q where C > 0 is soe constant. Applying union bound over all a, b [n we have that the following holds with probability at least δ sup Ĉ S [h C D n 2 log(n log( + log( n2 δ [h C. h H q C.3. Proof of Theore 6 Theore (ψ-regret of Frank-Wolfe ethod based algorith. Let ψ : [0, n n R + be concave over C D, and L-Lipschitz and β-sooth w.r.t. the l nor. Let S (S, S 2 (X [n be a training saple drawn i.i.d. fro D. Further, let η : X n be the CPE odel learned fro S in Algorith and h FW S : X n be the classifier obtained after κ iterations. Then for any δ (0,, with probability at least δ (over draw of S fro D, P ψ, D [ Pψ D [hfw S 4LE X η(x η(x + 4 2βn 2 C n 2 log(n log( + log(n 2 /δ + 8β κ + 2, where C > 0 is a distribution-independent constant. We first prove an iportant lea where we bound the approxiation error of the linear optiization oracle used in the algorith using Lea 4 and 5. This result coupled with the standard convergence analysis for the Frank-Wolfe ethod (Jaggi, 203 will then allow us to prove the above theore. Lea 26. Let ψ : [0, n n R + be concave over C D, and L-Lipschitz and β-sooth w.r.t. the l nor. Let classifiers ĝ,..., ĝ T, and h 0, h,..., h T be as defined in Algorith. Then for any δ (0,, with probability at least δ (over draw of S fro D, we have for all t T ψ(c D [h t, C D [ĝ t ax g:x : n ψ(c D [h t, C D [g ɛ S where ɛ S 2LE X [ η(x η(x + 2 2Cβn 2 n 2 log(n log( + log( n2 δ. Proof. For any t T, let g t, argax g:x n ψ(c D [h t, C D [g. We then have ax ψ(c D [h t, C D [g ψ(c D [h t, C D [ĝ t g:x : n ψ(c D [h t, C D [g t, ψ(c D [h t, C D [ĝ t ψ(c D [h t, C D [g t, ψ(ĉs2 [h t, C D [g t, }{{} ter + ψ(ĉs2 [h t, C D [g t, ψ(ĉs2 [h t, C D [ĝ t }{{} ter 2 + ψ(ĉs2 [h t, C D [ĝ t ψ(c D [h t, C D [ĝ t. }{{} ter 3

12 We next bound each of these ters. We start with ter 2. For any t T, let L t be as defined in Algorith. Since L t is a scaled and translated version of the gradient ψ(ĉs2 [h t, we have ψ(ĉs2 [h t c t a t Lt, for soe constant c t R and a t [0, 2L. Thus for all t T, ψ(ĉs2 [h t, C D [g t, ψ(ĉs2 [h t, C D [ĝ t where the third step uses the definition of P L t, D Next, for ter, we have by an application of Holder s inequality a t ( L t, C D [ĝ t L t, C D [g t, a t (P L t D [g t, P L t D [ĝ t a t (P L t, D P L t D [ĝ t 2L (P L t, D P L t D [ĝ t [ 2L E X η(x η(x, and the last step follows fro Lea 4. ψ(c D [h t, C D [g t, ψ(ĉs2 [h t, C D [g t, ψ( Ĉ S2 [h t ψ(c D [h t C D [g t, ψ(ĉs2 [h t ψ(c D [h t ( β Ĉ S2 [h t C D [h t βn 2 Ĉ S2 [h t C D [h t βn 2 ax Ĉ S2 [ĝ i C D [ĝ i i [t βn 2 sup h H η Ĉ S 2 [h C D [h, where the third step follows fro β-soothness of ψ. One can siilarly bound ter 3. We thus have for all t T, ax ψ(c D [h t, C D [g ψ(c D [h t, C D [ĝ t g:x n 2L E X [ η(x η(x + 2βn 2 sup h H η Ĉ S 2 [h C D [h. Applying Lea 5 with S 2 /2 exaples, we have with probability δ (over rando draw of S 2 fro D, for all t T, ax ψ(c D [h t, C D [g ψ(c D [h t, C D [ĝ t g:x n [ 2LE X η(x η(x + 2 2Cβn 2 n 2 log(n log( + log( n2 δ. We are now ready to prove Theore 6. Proof of Theore 6. Our proof shall ake use of Lea 26 and the standard convergence result for the Frank-Wolfe algorith for axiizing a concave function over a convex set (Jaggi, 203. We will find it useful to first define the following quantity, referred to as the curvature constant in (Jaggi, 203. C ψ sup C,C 2 C D,γ [0, 2 ( γ 2 ψ ( C + γ(c 2 C ψ ( C γ C2 C, ψ(c. Also, define two positive scalars ɛ S and δ apx required in the analysis of (Jaggi, 203: [ ɛ S 2LE X η(x η(x + 2 2Cβn 2 n 2 log(n log( + log( n2 δ

13 δ apx (T + ɛ S C ψ, where δ (0, is as in the theore stateent. Further, let the classifiers ĝ,..., ĝ T, and h 0, h,..., h T be as defined in Algorith. We then have fro Lea 26 that the following holds with probability at least δ, for all t T, ψ ( C D [ h t, C D [ ĝ t ax g:x : n ψ ( C D [ h t, C D [g ɛ S ax ψ ( C D [ h t, C ɛ S ( ax ψ C D [ h t, C ɛ S ( ax ψ C D [ h t, C 2 δ 2 apx T + C ψ ax ψ ( C D [ h t, C 2 δ 2 apx t + C ψ. (9 Also observe that for the two sequences of iterates given by the confusion atrices of the above classifiers, ( C D [h t 2 C D [h t + 2 t + t + CD [ĝ t,. (0 for all t T. Based on Eq. (9 and Eq. (0, one can now apply the result of (Jaggi, 203. In particular, the sequence of iterates C D [h 0, C D [h,..., C D [h T can be considered as the sequence of iterates arising fro running the Frank-Wolfe optiization ethod to axiize ψ over C D with a linear optiization oracle that is 2 δ apx 2 t+ C ψ accurate at iteration t. Since ψ is a concave function over the convex constraint set C D, one has fro Theore in (Jaggi, 203 that the following convergence guarantee holds with probability at least δ: P ψ D [hfw S ψ(c D [h FW S ψ(c D [h T ax ψ(c 2C ψ T + 2 ( + δ apx ax ψ(c 2C ψ T + 2 2ɛ S(T + T + 2 ax ψ(c 2C ψ T + 2 2ɛ S ( We can further upper bound C ψ in the above inequality in ters of the the soothness paraeter of ψ: 2 ( C ψ sup C,C 2 C D,γ [0, γ 2 ψ ( C + γ(c 2 C ψ ( C γ C2 C, ψ(c sup C,C 2 C D,γ [0, 4β, 2 γ 2 ( β 2 γ2 C C 2 2 where the second step follows fro the β-soothness of ψ. Substituting back in Eq. (, we finally have with probability at least δ, P ψ D [hfw S ax ψ(c 8β T + 2 2ɛ S P ψ, D 4LE [ X η(x η(x 4 2Cβn 2 which follows fro the definition of ɛ S. Setting T κ copletes the proof. n 2 log(n log( + log( n2 δ 8β T + 2,

14 C.4. Proof of Theore 7 Consistent Multiclass Algoriths for Coplex Perforance Measures A,C B,C Theore. Let ψ : [0, n n R + be such that ψ(c where A, B Rn n +, sup C CD ψ(c, and in C CD B, C b, for soe b > 0. Let S (S, S 2 (X [n be a training saple drawn i.i.d. fro D. Let η : X n be the CPE odel learned fro S in Algorith 2 and h BS S : X [n be the classifier obtained after κ iterations. Then for any δ (0,, with probability at least δ (over draw of S fro D, P, D P D[h BS S 2τE X [ η(x η(x + 2 2Cτ n 2 log(n log( + log(n 2 /δ where τ b ( A + B and C > 0 is a distribution-independent constant. We will find it useful to state the following leas: + 2 κ, Lea 27 (Invariant in Algorith 2. Let ψ be as defined in Theore 7. Let H η {h : X [n, h(x argin n j [n i η i(xl ij L [0, n n }. Then the following invariant is true at the end of each iteration 0 t T of Algorith 2: α t τ ɛ < ψ(c D [h t sup ψ(c β t + τ ɛ, where τ b ( A + B and ɛ E X [ η(x η(x + suph H η C D [h ĈS2 [h. Proof. We first have fro Lea 4, the following guarantee for the linear iniization step at each iteration t of Algorith 2: L t, C D [ĝ t in L t, C + E X [ η(x η(x Further, let us denote ɛ sup h Hη C D [h ĈS2 [h. Notice that ɛ ɛ + ɛ. in L t, C + ɛ (say. (2 We shall now prove this lea by atheatical induction on the iteration nuber t. For t 0, the invariant holds trivially as 0 ψ(c D [h 0. Assue the invariant holds at the end of iteration t {0,..., T }; we shall prove that the invariant holds at the end of iteration t. In particular, we consider two cases at iteration t. In the first case, ψ( Γ t γ t, leading to the assignents α t γ t, β t β t, and h t ĝ t. We have fro the definition of ɛ A γ t B, C D [ĝ t A γ t B, Γ t A γ t B ɛ A, Γ t γ t B, Γ t A γ t B ɛ B, Γ t ( ψ( Γ t γ t A γ t B ɛ 0 A γ t B ɛ > A γ t B (2ɛ + ɛ ( A + B (2ɛ + ɛ, where the third step follows fro our case assuption that ψ( Γ t γ t and B, Γ t > 0, the fifth step follows fro ɛ > 0, and the last step follows fro triangle inequality and γ t sup C CD ψ(c. The above inequality further gives us A, C D [ĝ t B, C D [ĝ t > γ t A + B B, C D [ĝ t (2ɛ + ɛ γ t b ( A + B (2ɛ + ɛ γ t τ(2ɛ + ɛ γ t τ ɛ α t τ ɛ,

15 where the second step follows fro in B, C > b and the last step follows fro the assignent α t γ t. In other words, ψ ( C D [h t ψ(c D [ĝ t A, CD [ĝ t B, C D [ĝ t > αt τ ɛ. Moreover, by our assuption that the invariant holds at the end of iteration t, we have β t + τ ɛ β t + τ ɛ ax ψ(c ψ(c D [h t > α t τ ɛ. Thus under the first case, the invariant holds at the end of iteration t. In the second case, ψ( Γ t < γ t at iteration t, which would lead to the assignents α t α t, β t γ t, and h t h t. Since the invariant is assued to hold at the end of iteration t, we have α t τ ɛ α t τ ɛ ψ(c D [h t ψ(c D [h t. (3 Next, recall that L t [0, n n is a scaled and translated version of (A γ t B; clearly, there exists c t R and 0 < a t 2 A γ t B such that A γ t B c t a t L t. Then for C argax A γ t B, C, we have C C D A γ t B, C c t a t L t, C c t a t L t, C D [h t + a t ɛ c t a t L t, C D [h t + 2 A γ t B ɛ A γ t B, C D [h t + 2 A γ t B ɛ A γ t B, Γ t + A γ t B ɛ + 2 A γ t B ɛ A, Γ t γ t B, Γ t + A γ t B ɛ + 2 A γ t B ɛ B, Γ t (ψ( Γ t γ t + A γ t B ɛ + 2 A γ t B ɛ B, Γ t (0 + A γ t B ɛ + 2 A γ t B ɛ A γ t B (2ɛ + ɛ ( A + B (2ɛ + ɛ, where the second step follows fro Eq. (2, the third step uses a t A γ t B, the fifth step follows fro the definition of ɛ and Holder s inequality, the seventh step follows fro our case assuption that ψ( Γ t γ t and B, Γ t > 0, and the last step follows fro triangle inequality and γ t sup ψ(c. In particular, we have for all C C D, or A γ t B, C ( A + B (2ɛ + ɛ, A, C B, C γt + A + B (2ɛ + ɛ. B, C Since in B, C > b, we have fro the above, for all C C D, In other words, ψ(c γ t + b ( A + B (2ɛ + ɛ γ t + τ ɛ β t + τ ɛ. sup ψ(c β t + τ ɛ. By cobining the above with Eq. (3, we can see that the invariant holds in iteration t under this case as well. This copletes the proof of the lea.

16 Lea 28 (Multiplicative Progress in Each Iteration of Algorith 2. Let ψ be as defined in Theore 7. Then the following is true in each iteration t T of Algorith 2: β t α t 2( β t α t. Proof. We consider two cases in each iteration of Algorith 2. If in an iteration t {,..., T }, ψ( Γ t γ t, leading to the assignent α t γ t, then β t α t β t γ t β t αt + β t 2 2( β t α t. On the other hand, if ψ( Γ t < γ t, leading to the assignent β t γ t, then β t α t γ t α t αt + β t 2 Thus in both cases, the stateent of the lea is seen to hold. We now prove Theore 7. 2 (βt α t. α t Proof of Theore 7. For the classifier h BS S ht output by Algorith 2 after T iterations, we have fro Lea 27 P, D P D[h BS S sup ψ(c ψ ( C D [h T < (β t + τ ɛ (α t τ ɛ β t α t + 2τ ɛ 2 T ( β 0 α 0 + 2τ ɛ 2 T ( 0 + 2τ ɛ 2 T + 2τ ɛ, where ɛ is as defined in Lea 27; the fifth step above follows fro Lea 28. Setting T κ thus gives us P, D P D[h BS S 2τE X [ η(x η(x + 2τ sup h H η C D [h ĈS2 [h + 2 κ. By an application Lea 5 to the second ter in the right-hand side of the above inequality (noting that S 2 /2, we then have for any δ > 0, with probability at least δ, P, D P D[ĥBS S 2τE X [ η(x η(x + 2 2Cτ n 2 log(n log( + log(n 2 /δ for a distribution-independent constant C > 0. C.5. Extending Algorith to Non-Sooth Perforance Measures + 2 κ, In Section 5, we showed that Algorith was consistent for any concave sooth perforance easure (see Theore 6. We now extend this result to concave perforance easures for which the associated ψ is non-sooth (but differentiable; these include the G-ean, H-ean and Q-ean perforance easures in. In particular, for these perforance easures, we prescribe that Algorith be applied to a suitable sooth approxiation to ψ; if the quality of this approxiation iproves with the size of the given training saple (at an appropriate rate, then the resulting algorith can be shown to be consistent for the original perforance easure.

17 Theore 29. Let ψ : [0, n n R + be such that for any ρ R +, there exists ψ ρ : [0, n n R + which is concave over C D, L ρ Lipschitz and β ρ sooth w.r.t. the l nor with sup ψ(c ψ ρ (C θ(ρ, C C D for soe strictly increasing function θ : R + R +. Let S (S, S 2 (X [n be a training saple drawn i.i.d. fro D. Further, let η : X n be the CPE odel learned fro S in Algorith and h FW,ρ S : X n be the classifier obtained after κ iterations by Algorith when run for the perforance easure ψ ρ. Then for any δ (0,, with probability at least δ (over draw of S fro D, P ψ, D [ Pψ D [hfw,ρ S 4L ρ E X η(x η(x + 4 2βρ n 2 C where C > 0 is a distribution-independent constant. n 2 log(n log( + log(n 2 /δ + 8β ρ κ θ(ρ, Proof. Fro Theore 6 we have that P ψρ, D P ψρ D [hfw,ρ S 4L ρ E X [ η(x η(x + 4 2βρ n 2 C n 2 log(n log( + log(n 2 /δ + 8β ρ κ + 2. (4 For siplicity assue that the ψ-optial classifier exists; the proof can be easily extended when this is not the case. Let h : X n be a ψ-optial classifier; note that this classifier need not be ψ ρ -optial. We then have that P ψ, D Pψ D [hfw,ρ S ψ(c D [h ψ(c D [h FW,ρ S ψ ρ (C D [h ψ ρ (C D [h FW,ρ S + 2θ(ρ P ψρ D [h P ψρ P ψρ, D P ψρ D [hfw,ρ S D [hfw,ρ S + 2θ(ρ + 2θ(ρ 4L ρ E X [ η(x η(x + 4 2βρ n 2 C n 2 log(n log( + log(n 2 /δ + 8β ρ κ θ(ρ, where the second step follows fro our assuption that sup C CD ψ(c ψ ρ (C θ(ρ, and the fifth step follows fro the definition of P ψρ, and the last step uses Eq. (4. This copletes the proof. We note that for each of G-ean, H-ean and Q-ean, one can construct a Lipschitz sooth approxiation ψ ρ as required in the above theore. Now, suppose the CPE algorith in Algorith is such that the class probability estiation error ter in the theore E X [ η(x η(x P 0 (as the nuber of training exaples. Then for each of the given perforance easures, one can allow the paraeter ρ (that deterines the approxiation quality of ψ ρ to go to 0 as (at appropriate rate, so that the right-hand side of the bound in the theore goes to 0 (as, iplying that Algorith is ψ-consistent. We postpone the details to a longer version of the paper. D. Suppleentary Material for Section 6 (Experients D.. Coputation of Class Probability Function for Distribution Considered in Synthetic Data Experients We provide the calculations for the class probability function for the distribution considered in synthetic data experients in Section 6. We present this for a ore general distribution over R d [n, where for each class i, the class prior probability is π i and the class conditional distribution is a Gaussian distribution with ean µ i R d and the sae (syetric positive seidefinite covariance atrix Σ R d d. We shall denote the pdf for the Gaussian corresponding to class i as f i (x exp ( (2π d Σ 2 (x µ i Σ (x µ i. The class probability function for this distribution is then given by η i (x P(Y i X x π i f i (x n j π jf j (x

18 Table 6. Data sets used in experients in Sections Data set # instances # features # classes UCI car pageblocks glass satiage covtype yeast abalone IR cora news rcv exp ( 2 (x µ i Σ (x µ i + ln π i n j exp ( 2 (x µ j Σ (x µ j + ln π j exp ( ( µ i Σ x 2 µ i Σ µ i + ln π i exp 2 x Σ x n j exp ( ( µ j Σ x 2 µ j Σ µ j + ln π j exp 2 x Σ x exp(wi x + b i n j exp(w j x + b j, where w i Σ µ i and b i 2 µ i Σ µ i + ln π i. Clearly, the class probability function for the distribution considered can be obtained as a softax of a linear function. D.2. Additional Experiental Details/Results In all our experients, the regularization paraeter for each algorith was chosen fro the range {0 4,..., 0 4 } using a held-out portion of the training set. Synthetic data experients. Since the distribution used to generate synthetic data and the four perforance easures considered satisfy the condition in Theore 3, the optial classifier for each perforance easure can be obtained by coputing the ψ L -optial classifier for soe loss atrix L [0, n n ; we have a siilar characterization for icro F using Theore 3. In our experients, we coputed the optial perforance for a given perforance easure by perforing a search over a large range of n n loss atrices L, used the true conditional class probability to copute a ψ L -optial classifier for each such L (see Proposition 6, and chose aong these classifiers the one which gave the highest perforance value (on a large saple drawn fro the distribution. Moreover, since the class probability function here is a softax of linear functions, it follows that the Bayes optial perforance is also achieved by a linear classification odel, and therefore learning a linear odel suffices to achieve consistency; we therefore learn a linear classification odel in all experients. Also, recall that Algorith outputs a randoized classifier, while Algorith 2 outputs a deterinistic classifier. In our experiental results, the ψ-perforance of a randoized classifier was evaluated using the expected (epirical confusion atrix of the deterinistic classifiers in its support. Real data experients. All real data sets used in our experients have been listed in Table 6. The version of the CoRA data set used in our experients was obtained fro The 20 Newsgroup data was obtained fro jason/20newsgroups/. For the RCV data, we used a preprocessed version obtained fro cjlin/libsvtools/ datasets/. For each of the UCI data sets used in our experients, the training set was noralized to 0 ean and unit variance, and this transforation was applied on the test set. Table 7 9 contains results on UCI data sets not provided in Section 6. Table 0 2 contains training ties for Algorith (applied to the G-ean, H-ean and Q-ean easures and the baseline SVM perf and 0- logistic regression ethods on all UCI data sets; in each case, the sybol against SVM perf indicates the ethod did not coplete after 96 hrs. Ipleentation details. The proposed Frank-Wolfe based and bisection based algoriths were ipleented in MAT- LAB; in order to learn a CPE odel in these algoriths, we used the ulticlass logistic regression solver provided in schidt/software/infunc.htl for the experients on the synthetic and UCI data sets, and the liblinear logistic regression ipleentation provided in cjlin/

19 car pageblocks glass satiage covtype yeast abalone Frank-Wolfe (GM SVM perf (GM LogReg ( Table 7. Results for G-ean on UCI data sets. car pageblocks glass satiage covtype yeast abalone Frank-Wolfe (HM SVM perf (HM LogReg ( Table 8. Results for H-ean on UCI data sets. car pageblocks glass satiage covtype yeast abalone Frank-Wolfe (QM SVM perf (QM LogReg ( Table 9. Results for Q-ean on UCI data sets. car pageblocks glass satiage covtype yeast abalone Frank-Wolfe (GM SVM perf (GM LogReg ( Table 0. Training tie (in secs for G-ean on UCI data sets. car pageblocks glass satiage covtype yeast abalone Frank-Wolfe (HM SVM perf (HM LogReg ( Table. Training tie (in secs for H-ean on UCI data sets. car pageblocks glass satiage covtype yeast abalone Frank-Wolfe (QM SVM perf (QM LogReg ( Table 2. Training tie (in secs for Q-ean on UCI data sets. liblinear for the experients on IR data. All run-tie experients were run on Intel Xeon quad-core achines (2.66 GHz, 2 MB cache with 6 GB RAM. We ipleented SVM perf using a publicly available structural SVM API 9. The SVM perf ethod (proposed originally for binary perforance easures (Joachis, 2005 uses a cutting plane solver where coputing the ost-violated constraint requires a search over of all valid confusion atrices for the given training saple. In the case of the G-ean, H-ean and Q-ean easures, this search can be restricted to the diagonal entries of the confusion atrix, but will still require (in the worst case tie exponential in the nuber of classes; in the case of the icro F, this search is ore expensive and will involve searching over 3n 3 entries of the confusion atrix. While we use an exact ipleentation of SVM perf for these three perforance easures, for the icro F -easure, we use a version that optiizes an approxiation to the icro F (in particular, optiizes the variant of icro F analyzed by (Parabath et al., 204 and requires fewer coputations. The tolerance paraeter for the cutting-plane ethod in SVM perf was set to 0.0 for all experients except on the Pageblocks and CoRA data sets, where the tolerance was set to 0. to enable faster run-tie. 9

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory Proble sets 5 and 6 Due: Noveber th Please send your solutions to learning-subissions@ttic.edu Notations/Definitions Recall the definition of saple based Radeacher

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee227c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee227c@berkeley.edu October

More information

CS Lecture 13. More Maximum Likelihood

CS Lecture 13. More Maximum Likelihood CS 6347 Lecture 13 More Maxiu Likelihood Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2 Maxiu Likelihood

More information

A Simple Regression Problem

A Simple Regression Problem A Siple Regression Proble R. M. Castro March 23, 2 In this brief note a siple regression proble will be introduced, illustrating clearly the bias-variance tradeoff. Let Y i f(x i ) + W i, i,..., n, where

More information

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis E0 370 tatistical Learning Theory Lecture 6 (Aug 30, 20) Margin Analysis Lecturer: hivani Agarwal cribe: Narasihan R Introduction In the last few lectures we have seen how to obtain high confidence bounds

More information

Block designs and statistics

Block designs and statistics Bloc designs and statistics Notes for Math 447 May 3, 2011 The ain paraeters of a bloc design are nuber of varieties v, bloc size, nuber of blocs b. A design is built on a set of v eleents. Each eleent

More information

1 Bounding the Margin

1 Bounding the Margin COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #12 Scribe: Jian Min Si March 14, 2013 1 Bounding the Margin We are continuing the proof of a bound on the generalization error of AdaBoost

More information

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices CS71 Randoness & Coputation Spring 018 Instructor: Alistair Sinclair Lecture 13: February 7 Disclaier: These notes have not been subjected to the usual scrutiny accorded to foral publications. They ay

More information

1 Proof of learning bounds

1 Proof of learning bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #4 Scribe: Akshay Mittal February 13, 2013 1 Proof of learning bounds For intuition of the following theore, suppose there exists a

More information

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization Recent Researches in Coputer Science Support Vector Machine Classification of Uncertain and Ibalanced data using Robust Optiization RAGHAV PAT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering

More information

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011) E0 370 Statistical Learning Theory Lecture 5 Aug 5, 0 Covering Nubers, Pseudo-Diension, and Fat-Shattering Diension Lecturer: Shivani Agarwal Scribe: Shivani Agarwal Introduction So far we have seen how

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 018: Convex Optiization and Approxiation Instructor: Moritz Hardt Eail: hardt+ee7c@berkeley.edu Graduate Instructor: Max Sichowitz Eail: sichow+ee7c@berkeley.edu October 15,

More information

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes

More information

Least Squares Fitting of Data

Least Squares Fitting of Data Least Squares Fitting of Data David Eberly, Geoetric Tools, Redond WA 98052 https://www.geoetrictools.co/ This work is licensed under the Creative Coons Attribution 4.0 International License. To view a

More information

Sharp Time Data Tradeoffs for Linear Inverse Problems

Sharp Time Data Tradeoffs for Linear Inverse Problems Sharp Tie Data Tradeoffs for Linear Inverse Probles Saet Oyak Benjain Recht Mahdi Soltanolkotabi January 016 Abstract In this paper we characterize sharp tie-data tradeoffs for optiization probles used

More information

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search Quantu algoriths (CO 781, Winter 2008) Prof Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search ow we begin to discuss applications of quantu walks to search algoriths

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine CSCI699: Topics in Learning and Gae Theory Lecture October 23 Lecturer: Ilias Scribes: Ruixin Qiang and Alana Shine Today s topic is auction with saples. 1 Introduction to auctions Definition 1. In a single

More information

PAC-Bayes Analysis Of Maximum Entropy Learning

PAC-Bayes Analysis Of Maximum Entropy Learning PAC-Bayes Analysis Of Maxiu Entropy Learning John Shawe-Taylor and David R. Hardoon Centre for Coputational Statistics and Machine Learning Departent of Coputer Science University College London, UK, WC1E

More information

Stochastic Subgradient Methods

Stochastic Subgradient Methods Stochastic Subgradient Methods Lingjie Weng Yutian Chen Bren School of Inforation and Coputer Science University of California, Irvine {wengl, yutianc}@ics.uci.edu Abstract Stochastic subgradient ethods

More information

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab Support Vector Machines Machine Learning Series Jerry Jeychandra Bloh Lab Outline Main goal: To understand how support vector achines (SVMs) perfor optial classification for labelled data sets, also a

More information

Feature Extraction Techniques

Feature Extraction Techniques Feature Extraction Techniques Unsupervised Learning II Feature Extraction Unsupervised ethods can also be used to find features which can be useful for categorization. There are unsupervised ethods that

More information

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data Suppleentary to Learning Discriinative Bayesian Networks fro High-diensional Continuous Neuroiaging Data Luping Zhou, Lei Wang, Lingqiao Liu, Philip Ogunbona, and Dinggang Shen Proposition. Given a sparse

More information

3.3 Variational Characterization of Singular Values

3.3 Variational Characterization of Singular Values 3.3. Variational Characterization of Singular Values 61 3.3 Variational Characterization of Singular Values Since the singular values are square roots of the eigenvalues of the Heritian atrices A A and

More information

1 Rademacher Complexity Bounds

1 Rademacher Complexity Bounds COS 511: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #10 Scribe: Max Goer March 07, 2013 1 Radeacher Coplexity Bounds Recall the following theore fro last lecture: Theore 1. With probability

More information

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

A Smoothed Boosting Algorithm Using Probabilistic Output Codes A Soothed Boosting Algorith Using Probabilistic Output Codes Rong Jin rongjin@cse.su.edu Dept. of Coputer Science and Engineering, Michigan State University, MI 48824, USA Jian Zhang jian.zhang@cs.cu.edu

More information

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Maximizing the Margin Support Vector Machines Support vector achines (SVMs) learn a hypothesis: h(x) = b + Σ i= y i α i k(x, x i ) (x, y ),..., (x, y ) are the training exs., y i {, } b is the bias weight. α,..., α are the

More information

Kernel Methods and Support Vector Machines

Kernel Methods and Support Vector Machines Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic

More information

VC Dimension and Sauer s Lemma

VC Dimension and Sauer s Lemma CMSC 35900 (Spring 2008) Learning Theory Lecture: VC Diension and Sauer s Lea Instructors: Sha Kakade and Abuj Tewari Radeacher Averages and Growth Function Theore Let F be a class of ±-valued functions

More information

Symmetrization and Rademacher Averages

Symmetrization and Rademacher Averages Stat 928: Statistical Learning Theory Lecture: Syetrization and Radeacher Averages Instructor: Sha Kakade Radeacher Averages Recall that we are interested in bounding the difference between epirical and

More information

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions

Tight Information-Theoretic Lower Bounds for Welfare Maximization in Combinatorial Auctions Tight Inforation-Theoretic Lower Bounds for Welfare Maxiization in Cobinatorial Auctions Vahab Mirrokni Jan Vondrák Theory Group, Microsoft Dept of Matheatics Research Princeton University Redond, WA 9805

More information

Physics 215 Winter The Density Matrix

Physics 215 Winter The Density Matrix Physics 215 Winter 2018 The Density Matrix The quantu space of states is a Hilbert space H. Any state vector ψ H is a pure state. Since any linear cobination of eleents of H are also an eleent of H, it

More information

Bayes Decision Rule and Naïve Bayes Classifier

Bayes Decision Rule and Naïve Bayes Classifier Bayes Decision Rule and Naïve Bayes Classifier Le Song Machine Learning I CSE 6740, Fall 2013 Gaussian Mixture odel A density odel p(x) ay be ulti-odal: odel it as a ixture of uni-odal distributions (e.g.

More information

COS 424: Interacting with Data. Written Exercises

COS 424: Interacting with Data. Written Exercises COS 424: Interacting with Data Hoework #4 Spring 2007 Regression Due: Wednesday, April 18 Written Exercises See the course website for iportant inforation about collaboration and late policies, as well

More information

Robustness and Regularization of Support Vector Machines

Robustness and Regularization of Support Vector Machines Robustness and Regularization of Support Vector Machines Huan Xu ECE, McGill University Montreal, QC, Canada xuhuan@ci.cgill.ca Constantine Caraanis ECE, The University of Texas at Austin Austin, TX, USA

More information

Fixed-to-Variable Length Distribution Matching

Fixed-to-Variable Length Distribution Matching Fixed-to-Variable Length Distribution Matching Rana Ali Ajad and Georg Böcherer Institute for Counications Engineering Technische Universität München, Gerany Eail: raa2463@gail.co,georg.boecherer@tu.de

More information

List Scheduling and LPT Oliver Braun (09/05/2017)

List Scheduling and LPT Oliver Braun (09/05/2017) List Scheduling and LPT Oliver Braun (09/05/207) We investigate the classical scheduling proble P ax where a set of n independent jobs has to be processed on 2 parallel and identical processors (achines)

More information

A Theoretical Analysis of a Warm Start Technique

A Theoretical Analysis of a Warm Start Technique A Theoretical Analysis of a War Start Technique Martin A. Zinkevich Yahoo! Labs 701 First Avenue Sunnyvale, CA Abstract Batch gradient descent looks at every data point for every step, which is wasteful

More information

arxiv: v1 [cs.ds] 3 Feb 2014

arxiv: v1 [cs.ds] 3 Feb 2014 arxiv:40.043v [cs.ds] 3 Feb 04 A Bound on the Expected Optiality of Rando Feasible Solutions to Cobinatorial Optiization Probles Evan A. Sultani The Johns Hopins University APL evan@sultani.co http://www.sultani.co/

More information

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/

More information

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes

Rademacher Complexity Margin Bounds for Learning with a Large Number of Classes Radeacher Coplexity Margin Bounds for Learning with a Large Nuber of Classes Vitaly Kuznetsov Courant Institute of Matheatical Sciences, 25 Mercer street, New York, NY, 002 Mehryar Mohri Courant Institute

More information

Multi-Dimensional Hegselmann-Krause Dynamics

Multi-Dimensional Hegselmann-Krause Dynamics Multi-Diensional Hegselann-Krause Dynaics A. Nedić Industrial and Enterprise Systes Engineering Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu B. Touri Coordinated Science Laboratory

More information

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion Suppleentary Material for Fast and Provable Algoriths for Spectrally Sparse Signal Reconstruction via Low-Ran Hanel Matrix Copletion Jian-Feng Cai Tianing Wang Ke Wei March 1, 017 Abstract We establish

More information

Chapter 6 1-D Continuous Groups

Chapter 6 1-D Continuous Groups Chapter 6 1-D Continuous Groups Continuous groups consist of group eleents labelled by one or ore continuous variables, say a 1, a 2,, a r, where each variable has a well- defined range. This chapter explores:

More information

The proofs of Theorem 1-3 are along the lines of Wied and Galeano (2013).

The proofs of Theorem 1-3 are along the lines of Wied and Galeano (2013). A Appendix: Proofs The proofs of Theore 1-3 are along the lines of Wied and Galeano (2013) Proof of Theore 1 Let D[d 1, d 2 ] be the space of càdlàg functions on the interval [d 1, d 2 ] equipped with

More information

1 Identical Parallel Machines

1 Identical Parallel Machines FB3: Matheatik/Inforatik Dr. Syaantak Das Winter 2017/18 Optiizing under Uncertainty Lecture Notes 3: Scheduling to Miniize Makespan In any standard scheduling proble, we are given a set of jobs J = {j

More information

Support Vector Machines. Goals for the lecture

Support Vector Machines. Goals for the lecture Support Vector Machines Mark Craven and David Page Coputer Sciences 760 Spring 2018 www.biostat.wisc.edu/~craven/cs760/ Soe of the slides in these lectures have been adapted/borrowed fro aterials developed

More information

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

e-companion ONLY AVAILABLE IN ELECTRONIC FORM OPERATIONS RESEARCH doi 10.1287/opre.1070.0427ec pp. ec1 ec5 e-copanion ONLY AVAILABLE IN ELECTRONIC FORM infors 07 INFORMS Electronic Copanion A Learning Approach for Interactive Marketing to a Custoer

More information

Boosting with log-loss

Boosting with log-loss Boosting with log-loss Marco Cusuano-Towner Septeber 2, 202 The proble Suppose we have data exaples {x i, y i ) i =... } for a two-class proble with y i {, }. Let F x) be the predictor function with the

More information

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks Intelligent Systes: Reasoning and Recognition Jaes L. Crowley MOSIG M1 Winter Seester 2018 Lesson 7 1 March 2018 Outline Artificial Neural Networks Notation...2 Introduction...3 Key Equations... 3 Artificial

More information

3.8 Three Types of Convergence

3.8 Three Types of Convergence 3.8 Three Types of Convergence 3.8 Three Types of Convergence 93 Suppose that we are given a sequence functions {f k } k N on a set X and another function f on X. What does it ean for f k to converge to

More information

Distributed Subgradient Methods for Multi-agent Optimization

Distributed Subgradient Methods for Multi-agent Optimization 1 Distributed Subgradient Methods for Multi-agent Optiization Angelia Nedić and Asuan Ozdaglar October 29, 2007 Abstract We study a distributed coputation odel for optiizing a su of convex objective functions

More information

In this chapter, we consider several graph-theoretic and probabilistic models

In this chapter, we consider several graph-theoretic and probabilistic models THREE ONE GRAPH-THEORETIC AND STATISTICAL MODELS 3.1 INTRODUCTION In this chapter, we consider several graph-theoretic and probabilistic odels for a social network, which we do under different assuptions

More information

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical IEEE TRANSACTIONS ON INFORMATION THEORY Large Alphabet Source Coding using Independent Coponent Analysis Aichai Painsky, Meber, IEEE, Saharon Rosset and Meir Feder, Fellow, IEEE arxiv:67.7v [cs.it] Jul

More information

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION

RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION RANDOM GRADIENT EXTRAPOLATION FOR DISTRIBUTED AND STOCHASTIC OPTIMIZATION GUANGHUI LAN AND YI ZHOU Abstract. In this paper, we consider a class of finite-su convex optiization probles defined over a distributed

More information

Supplement to: Subsampling Methods for Persistent Homology

Supplement to: Subsampling Methods for Persistent Homology Suppleent to: Subsapling Methods for Persistent Hoology A. Technical results In this section, we present soe technical results that will be used to prove the ain theores. First, we expand the notation

More information

Research Article On the Isolated Vertices and Connectivity in Random Intersection Graphs

Research Article On the Isolated Vertices and Connectivity in Random Intersection Graphs International Cobinatorics Volue 2011, Article ID 872703, 9 pages doi:10.1155/2011/872703 Research Article On the Isolated Vertices and Connectivity in Rando Intersection Graphs Yilun Shang Institute for

More information

Support recovery in compressed sensing: An estimation theoretic approach

Support recovery in compressed sensing: An estimation theoretic approach Support recovery in copressed sensing: An estiation theoretic approach Ain Karbasi, Ali Horati, Soheil Mohajer, Martin Vetterli School of Coputer and Counication Sciences École Polytechnique Fédérale de

More information

Fairness via priority scheduling

Fairness via priority scheduling Fairness via priority scheduling Veeraruna Kavitha, N Heachandra and Debayan Das IEOR, IIT Bobay, Mubai, 400076, India vavitha,nh,debayan}@iitbacin Abstract In the context of ulti-agent resource allocation

More information

Soft-margin SVM can address linearly separable problems with outliers

Soft-margin SVM can address linearly separable problems with outliers Non-linear Support Vector Machines Non-linearly separable probles Hard-argin SVM can address linearly separable probles Soft-argin SVM can address linearly separable probles with outliers Non-linearly

More information

Ch 12: Variations on Backpropagation

Ch 12: Variations on Backpropagation Ch 2: Variations on Backpropagation The basic backpropagation algorith is too slow for ost practical applications. It ay take days or weeks of coputer tie. We deonstrate why the backpropagation algorith

More information

Solutions 1. Introduction to Coding Theory - Spring 2010 Solutions 1. Exercise 1.1. See Examples 1.2 and 1.11 in the course notes.

Solutions 1. Introduction to Coding Theory - Spring 2010 Solutions 1. Exercise 1.1. See Examples 1.2 and 1.11 in the course notes. Solutions 1 Exercise 1.1. See Exaples 1.2 and 1.11 in the course notes. Exercise 1.2. Observe that the Haing distance of two vectors is the iniu nuber of bit flips required to transfor one into the other.

More information

Detection and Estimation Theory

Detection and Estimation Theory ESE 54 Detection and Estiation Theory Joseph A. O Sullivan Sauel C. Sachs Professor Electronic Systes and Signals Research Laboratory Electrical and Systes Engineering Washington University 11 Urbauer

More information

Randomized Recovery for Boolean Compressed Sensing

Randomized Recovery for Boolean Compressed Sensing Randoized Recovery for Boolean Copressed Sensing Mitra Fatei and Martin Vetterli Laboratory of Audiovisual Counication École Polytechnique Fédéral de Lausanne (EPFL) Eail: {itra.fatei, artin.vetterli}@epfl.ch

More information

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2017 Lesson 1 4 October 2017 Outline Learning and Evaluation for Pattern Recognition Notation...2 1. The Pattern Recognition

More information

Shannon Sampling II. Connections to Learning Theory

Shannon Sampling II. Connections to Learning Theory Shannon Sapling II Connections to Learning heory Steve Sale oyota echnological Institute at Chicago 147 East 60th Street, Chicago, IL 60637, USA E-ail: sale@athberkeleyedu Ding-Xuan Zhou Departent of Matheatics,

More information

Learnability of Gaussians with flexible variances

Learnability of Gaussians with flexible variances Learnability of Gaussians with flexible variances Ding-Xuan Zhou City University of Hong Kong E-ail: azhou@cityu.edu.hk Supported in part by Research Grants Council of Hong Kong Start October 20, 2007

More information

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points

More information

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Boosting Mehryar Mohri Courant Institute and Google Research ohri@cis.nyu.edu Weak Learning Definition: concept class C is weakly PAC-learnable if there exists a (weak)

More information

arxiv: v1 [cs.ds] 17 Mar 2016

arxiv: v1 [cs.ds] 17 Mar 2016 Tight Bounds for Single-Pass Streaing Coplexity of the Set Cover Proble Sepehr Assadi Sanjeev Khanna Yang Li Abstract arxiv:1603.05715v1 [cs.ds] 17 Mar 2016 We resolve the space coplexity of single-pass

More information

Optimal Jamming Over Additive Noise: Vector Source-Channel Case

Optimal Jamming Over Additive Noise: Vector Source-Channel Case Fifty-first Annual Allerton Conference Allerton House, UIUC, Illinois, USA October 2-3, 2013 Optial Jaing Over Additive Noise: Vector Source-Channel Case Erah Akyol and Kenneth Rose Abstract This paper

More information

Estimating Parameters for a Gaussian pdf

Estimating Parameters for a Gaussian pdf Pattern Recognition and achine Learning Jaes L. Crowley ENSIAG 3 IS First Seester 00/0 Lesson 5 7 Noveber 00 Contents Estiating Paraeters for a Gaussian pdf Notation... The Pattern Recognition Proble...3

More information

Lecture 21. Interior Point Methods Setup and Algorithm

Lecture 21. Interior Point Methods Setup and Algorithm Lecture 21 Interior Point Methods In 1984, Kararkar introduced a new weakly polynoial tie algorith for solving LPs [Kar84a], [Kar84b]. His algorith was theoretically faster than the ellipsoid ethod and

More information

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES

Proc. of the IEEE/OES Seventh Working Conference on Current Measurement Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES Proc. of the IEEE/OES Seventh Working Conference on Current Measureent Technology UNCERTAINTIES IN SEASONDE CURRENT VELOCITIES Belinda Lipa Codar Ocean Sensors 15 La Sandra Way, Portola Valley, CA 98 blipa@pogo.co

More information

1 Generalization bounds based on Rademacher complexity

1 Generalization bounds based on Rademacher complexity COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #0 Scribe: Suqi Liu March 07, 08 Last tie we started proving this very general result about how quickly the epirical average converges

More information

Lecture 9 November 23, 2015

Lecture 9 November 23, 2015 CSC244: Discrepancy Theory in Coputer Science Fall 25 Aleksandar Nikolov Lecture 9 Noveber 23, 25 Scribe: Nick Spooner Properties of γ 2 Recall that γ 2 (A) is defined for A R n as follows: γ 2 (A) = in{r(u)

More information

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression

Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Efficient Learning of Generalized Linear and Single Index Models with Isotonic Regression Sha M Kakade Microsoft Research and Wharton, U Penn skakade@icrosoftco Varun Kanade SEAS, Harvard University vkanade@fasharvardedu

More information

Introduction to Machine Learning. Recitation 11

Introduction to Machine Learning. Recitation 11 Introduction to Machine Learning Lecturer: Regev Schweiger Recitation Fall Seester Scribe: Regev Schweiger. Kernel Ridge Regression We now take on the task of kernel-izing ridge regression. Let x,...,

More information

Exact tensor completion with sum-of-squares

Exact tensor completion with sum-of-squares Proceedings of Machine Learning Research vol 65:1 54, 2017 30th Annual Conference on Learning Theory Exact tensor copletion with su-of-squares Aaron Potechin Institute for Advanced Study, Princeton David

More information

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison

Nyström Method vs Random Fourier Features: A Theoretical and Empirical Comparison yströ Method vs : A Theoretical and Epirical Coparison Tianbao Yang, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, Zhi-Hua Zhou Machine Learning Lab, GE Global Research, San Raon, CA 94583 Michigan State University,

More information

Tail estimates for norms of sums of log-concave random vectors

Tail estimates for norms of sums of log-concave random vectors Tail estiates for nors of sus of log-concave rando vectors Rados law Adaczak Rafa l Lata la Alexander E. Litvak Alain Pajor Nicole Toczak-Jaegerann Abstract We establish new tail estiates for order statistics

More information

Computable Shell Decomposition Bounds

Computable Shell Decomposition Bounds Coputable Shell Decoposition Bounds John Langford TTI-Chicago jcl@cs.cu.edu David McAllester TTI-Chicago dac@autoreason.co Editor: Leslie Pack Kaelbling and David Cohn Abstract Haussler, Kearns, Seung

More information

Polygonal Designs: Existence and Construction

Polygonal Designs: Existence and Construction Polygonal Designs: Existence and Construction John Hegean Departent of Matheatics, Stanford University, Stanford, CA 9405 Jeff Langford Departent of Matheatics, Drake University, Des Moines, IA 5011 G

More information

Probability Distributions

Probability Distributions Probability Distributions In Chapter, we ephasized the central role played by probability theory in the solution of pattern recognition probles. We turn now to an exploration of soe particular exaples

More information

Lower Bounds for Quantized Matrix Completion

Lower Bounds for Quantized Matrix Completion Lower Bounds for Quantized Matrix Copletion Mary Wootters and Yaniv Plan Departent of Matheatics University of Michigan Ann Arbor, MI Eail: wootters, yplan}@uich.edu Mark A. Davenport School of Elec. &

More information

Learnability and Stability in the General Learning Setting

Learnability and Stability in the General Learning Setting Learnability and Stability in the General Learning Setting Shai Shalev-Shwartz TTI-Chicago shai@tti-c.org Ohad Shair The Hebrew University ohadsh@cs.huji.ac.il Nathan Srebro TTI-Chicago nati@uchicago.edu

More information

arxiv: v1 [cs.lg] 8 Jan 2019

arxiv: v1 [cs.lg] 8 Jan 2019 Data Masking with Privacy Guarantees Anh T. Pha Oregon State University phatheanhbka@gail.co Shalini Ghosh Sasung Research shalini.ghosh@gail.co Vinod Yegneswaran SRI international vinod@csl.sri.co arxiv:90.085v

More information

Machine Learning Basics: Estimators, Bias and Variance

Machine Learning Basics: Estimators, Bias and Variance Machine Learning Basics: Estiators, Bias and Variance Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics in Basics

More information

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon Model Fitting CURM Background Material, Fall 014 Dr. Doreen De Leon 1 Introduction Given a set of data points, we often want to fit a selected odel or type to the data (e.g., we suspect an exponential

More information

The Weierstrass Approximation Theorem

The Weierstrass Approximation Theorem 36 The Weierstrass Approxiation Theore Recall that the fundaental idea underlying the construction of the real nubers is approxiation by the sipler rational nubers. Firstly, nubers are often deterined

More information

Convex Programming for Scheduling Unrelated Parallel Machines

Convex Programming for Scheduling Unrelated Parallel Machines Convex Prograing for Scheduling Unrelated Parallel Machines Yossi Azar Air Epstein Abstract We consider the classical proble of scheduling parallel unrelated achines. Each job is to be processed by exactly

More information

Asynchronous Gossip Algorithms for Stochastic Optimization

Asynchronous Gossip Algorithms for Stochastic Optimization Asynchronous Gossip Algoriths for Stochastic Optiization S. Sundhar Ra ECE Dept. University of Illinois Urbana, IL 680 ssrini@illinois.edu A. Nedić IESE Dept. University of Illinois Urbana, IL 680 angelia@illinois.edu

More information

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay A Low-Coplexity Congestion Control and Scheduling Algorith for Multihop Wireless Networks with Order-Optial Per-Flow Delay Po-Kai Huang, Xiaojun Lin, and Chih-Chun Wang School of Electrical and Coputer

More information

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs

On the Inapproximability of Vertex Cover on k-partite k-uniform Hypergraphs On the Inapproxiability of Vertex Cover on k-partite k-unifor Hypergraphs Venkatesan Guruswai and Rishi Saket Coputer Science Departent Carnegie Mellon University Pittsburgh, PA 1513. Abstract. Coputing

More information

Bipartite subgraphs and the smallest eigenvalue

Bipartite subgraphs and the smallest eigenvalue Bipartite subgraphs and the sallest eigenvalue Noga Alon Benny Sudaov Abstract Two results dealing with the relation between the sallest eigenvalue of a graph and its bipartite subgraphs are obtained.

More information

A1. Find all ordered pairs (a, b) of positive integers for which 1 a + 1 b = 3

A1. Find all ordered pairs (a, b) of positive integers for which 1 a + 1 b = 3 A. Find all ordered pairs a, b) of positive integers for which a + b = 3 08. Answer. The six ordered pairs are 009, 08), 08, 009), 009 337, 674) = 35043, 674), 009 346, 673) = 3584, 673), 674, 009 337)

More information

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks Pattern Recognition and Machine Learning Jaes L. Crowley ENSIMAG 3 - MMIS Fall Seester 2016 Lessons 7 14 Dec 2016 Outline Artificial Neural networks Notation...2 1. Introduction...3... 3 The Artificial

More information

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t. CS 493: Algoriths for Massive Data Sets Feb 2, 2002 Local Models, Bloo Filter Scribe: Qin Lv Local Models In global odels, every inverted file entry is copressed with the sae odel. This work wells when

More information

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science

A Better Algorithm For an Ancient Scheduling Problem. David R. Karger Steven J. Phillips Eric Torng. Department of Computer Science A Better Algorith For an Ancient Scheduling Proble David R. Karger Steven J. Phillips Eric Torng Departent of Coputer Science Stanford University Stanford, CA 9435-4 Abstract One of the oldest and siplest

More information