Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Size: px

Start display at page:

Download "Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material"

Jeffery Shepherd
5 years ago
Views:

1 Consistent Multiclass Algoriths for Coplex Perforance Measures Suppleentary Material Notations. Let λ be the base easure over n given by the unifor rando variable (say U over n. Hence, for all easurable A n, λ(a P(U A. Also, η : X n is the apping that gives the conditional probability vector η(x [P(Y X x,..., P(Y n X x n for a given instance x X. Let ν be the probability easure over the siplex induced by the rando variable η(x; in particular, for all easurable A n, ν(a P X µ (η(x A. For a atrix L [0, n n we let l, l 2,..., l n be the coluns of L. For any set A R d, the set A denotes its closure. For any vector v R n, we let (v (i denote the i th eleent when the coponents of v are sorted in ascending order. For any y [n, we shall denote rand(y [(y,..., (y n. A. Suppleentary Material For Section 2 (Coplex Perforance Measures A.. Details of Micro F -easure We consider the for of the icro F used in the BioNLP challenge (Ki et al., 203, which treats class as a default class (in inforation extraction, this class pertains to exaples for which no inforation is required to be extracted. One can then define the icro precision of a classifier h : X [n with confusion atrix C C D [h as the probability of an instance being correctly labelled, given that it was assigned by h a class other than : icroprec(c P(h(X Y h(x n i2 C ii n n i2 j C ji n i2 C ii n i C i Siilarly, the icro recall of h can be defined as the probability of an instance being correctly labelled, given that its true class was not : n i2 icrorec(c P(h(X Y Y C n ii n i2 n i2 j C C ii ij n i C. i The icro F that we analyze is the haronic ean of the icro precision and icro recall given above: ψ icrof (C 2 icroprec(c icrorec(c icroprec(c + icrorec(c 2 n i2 C ii 2 n i C i n i C. i Note that the above perforance easure can be written as a ratio-of-linear function: ψ icrof (C A,C B,C, where A 0, A ii 2, i, A ij 0, i j, and B 0, B i B i, i, B ij 2, i j. Also, note that this perforance easure satisfies the condition in Theore 7 with sup C CD ψ icrof (C, and in C CD B, C π > 0, and hence Algorith 2 is consistent for this perforance easure. Recently, Parabath et al. (204 also considered a for of icro F siilar to that used in the BioNLP challenge. The expression they use is slightly sipler than ours and differs slightly fro the BioNLP perforance easure: ψ icrof (C 2 n i2 C ii + n i2 C ii C. Another popular variant of the icro F involves averaging the entries of the one-versus-all binary confusion atrices for all classes, and coputing the F for the averaged atrix; as pointed out by Manning et al. (2008, this for of icro F effectively reduces to the 0- classification accuracy. B. Suppleentary Material for Section 3 (Bayes Optial Classifiers B.. Exaple Distribution Where the Optial Classifier Needs to be Randoized We present an exaple distribution where the optial perforance for the G-ean easure can be achieved only by a randoized classifier.

2 Exaple 5 (Distribution where the optial classifier needs to be randoized. Let D be a distribution over {x} {, 2} with η (x η 2 (x 2 and suppose we are interested in finding the optial classifier for the G-ean perforance easure (see Exaple 3 under D. The two deterinistic classifiers for this setting, naely, one which predicts on x and the other that predicts 2 on x, yield a G-ean of 0. However, the randoized classifier h (x [ 2, 2 has a G-ean value of 4 > 0 and can be verified to be the unique optial classifier for G-ean under D. We next present the proofs for the theores/leas/propositions in Section 3. B.2. Proof of Theore Theore (For of Bayes optial classifier for ratio-of-linear ψ. Let ψ : [0, n n R + be a ratio-of-linear perforance easure of the for ψ(c A,C B,C for soe A, B Rn n with B, C > 0 C C D. Let t D Pψ, D. Let L (A t D B, and let L [0, n n be obtained by scaling and shifting L so its entries lie in [0,. Then any classifier that is ψ L -optial is also ψ-optial. In the following, we oit the subscript on t D for easy of presentation. We first state the following lea using which we prove the above theore. Lea 8. Let ψ : [0, n n R + be such that ψ(c A,C B,C, for soe atrices A, B Rn n with B, C > 0 for all C C D. Let t sup C CD ψ(c. Then sup C CD A t B, C 0. Proof. Define φ : R R as φ(t sup C CD A tb, C. It is easy to see that φ (being a point-wise supreu of linear functions is convex, and hence a continuous function over R. By definition of t, we have for all C C D, Thus A, C B, C t or equivalently φ(t A t B, C 0. Also, for any t < t, there exists C C D such that φ(t sup A t B, C 0. (2 Thus for all t < t, A, C B, C > t or equivalently φ(t A tb, C > 0. φ(t sup A tb, C > 0. Next, by continuity of φ, for any onotonically increasing sequence of real nubers {t i } i converging to t, we have that φ(t i converges to φ(t ; since for each t i in this sequence φ(t i > 0, at the t, we have that φ(t 0. Along with Eq. (2, this gives us sup A t B, C φ(t 0. We next give the proof for Theore Proof of Theore. Let h : X n be a ψ L -optial classifier. We shall show that h is also ψ-optial, which will also iply existence of the ψ-optial classifier. Then we have L, C D [h sup h:x n L, C D [h sup L, C. Since L is a scaled and translated version of L A t B (where t P ψ,, we further have A t B, C D [h sup A t B, C.

3 Now, fro Lea 8 we know that A t B, C D [h 0. Hence, or equivalently, A, C D [h B, C D [h t, ψ(c D [h Thus h is also ψ-optial, which copletes the proof. B.3. Proof of Proposition 0 Proposition. C D is a convex set. sup ψ(c. Proof. Let C, C 2 C D. Let λ [0,. We will show that λc + ( λc 2 C D. By definition of C D, there exists randoized classifiers h, h 2 : X n such that C C D [h and C 2 C D [h 2. Consider the randoized classifier h λ : X n defined as h λ (x λh (x + ( λh 2 (x. It can be seen that C D [h λ λc + ( λc 2. B.4. Supporting Technical Leas For Lea 2 and Theore3 In this subsection we give soe supporting technical leas which will be useful in the proofs for Lea 2 and Theore 3. Lea 9 (Confusion atrix as an integration. Let f : n n. Then C D [f η p(f(p dν(p. p n Proof. C D i,j[f η E (X,Y D [f j (η(x (Y i E p ν E (X,Y D [ fj (p (Y i η(x p E p ν [ pi f j (p. Proposition 20 (Sufficiency of conditional probability. Let D be a distribution over X Y. For any randoized classifier h : X n there exists another randoized classifier h : X n such that C D [h C D [h and h is such that h f η, for soe f : n n. Proof. Let h : X n. Define f : n n as follows, We then have for any i, j [n that, f(p E X µ [h(x η(x p. C D i,j[h E (X,Y D [h j (X (Y i E p ν E (X,Y D [h j (X (Y i η(x p

4 E p ν [ E(X,Y D [h j (X η(x p E (X,Y D [(Y i η(x p E p ν [ fj (p p i C D i,j[f η where the third equality follows because, given η(x, the rando variables X and Y are independent, and the last inequality follows fro Lea 9. Lea 2 (Continuity of the C D apping. Let D be a distribution over X Y. Let f, f 2 : n n. Then C D [f η C D [f 2 η f (p f 2 (p dν(p. p n Proof. Let f, f 2 : n n C D [f η C D [f 2 η C D [f η C D [f 2 η p(f (p f 2 (p dν(p p n p(f (p f 2 (p dν(p p n p f (p f 2 (p dν(p p n f (p f 2 (p dν(p. p n Lea 22 (Volue of a inverse linear ap of an interval. Let d > 0 be any integer. Let V R d be copact and convex. Let f : R d R be an affine function such that it is non-constant over V. Let V be a vector valued rando variable taking values uniforly over V. Then, there exists a constant α > 0 such that for all c R and ɛ R + we have P(f(V [c, c + ɛ αɛ. Proof. Let us assue for now that affine hull of V is the entire space R d. For any integer i and set A, let vol i (A denote the i-th diensional volue of the set A. Note that vol i (A is undefined if the affine-hull diension of A is greater than i and is equal to zero if the affine-hull diension of A is lesser than i. For any r > 0 and any integer i > 0 let B i (r R i denote the set B i (r {x R i : x 2 r}. Also let R be the sallest value such that V B d (R. Let the affine function f be such that for all x R d, the value f(x g x + u. By the assuption of non-constancy of f on V we have that g 0. We now have that P(f(V [c, c + ɛ vol ( d {v V : c u g v c u + ɛ} vol d (V vol ( d {v Bd (R : c u g v c u + ɛ} vol d (V ɛ vol ( d Bd (R. vol d (V g 2 The last inequality follows fro the observation that d-volue of a strip of a d diensional sphere of radius r is at ost the d volue of a d diensional sphere of radius r ties the width of the strip, and the width of the strip under consideration here is siply ɛ g 2. Finally, if the affine hull of V is not the entire space R d, one can siply consider the affine-hull of V to be the entire (lesser diensional space and all the above arguents hold with soe affine transforations and a saller d.

5 Lea 23 (Fraction of instances with the best and second best prediction being siilar in perforance is sall. Let L [0, n n be such that no two coluns are identical. Let the distribution D over X Y be such that the easure over conditional probabilities ν, is absolutely continuous w.r.t. the base easure λ. Let c 0. Let A c n be the set A c {p n : (p L (2 (p L ( c} Let r : R + R + be the function defined as Then r(c ν(a c. (a r is a onotonically increasing function. (b There exists a C > 0 such that r is a continuous function over [0, C. (c r(0 0. Proof. Part (a: The fact that r is a onotonically increasing function is iediately obvious fro the observation that A a A b for any a < b. Part (b: Let C 2 in{d R : l y l y de for soe y, y [n, y y }, where e is the all ones vector. If there exists no y, y such that l y l y is a scalar ultiple of e, then we siply set C. Note that by our assuption on unequal coluns on L, we always have C > 0. For any c > 0 and y, y [n with y y, define the set A y,y c For any c, ɛ > 0, it can be clearly seen that A y,y c {p n : p l y p l y c}. ν(a c+ɛ ν(a c ν(a c+ɛ \ A c, A c+ɛ \ A c c+ɛ \ A y,y c, ν(a c+ɛ \ A c as y,y [n,y y y,y [n,y y ν c+ɛ \ A y,y c. Hence, our proof for continuity of r would be coplete, if we show that ν c+ɛ \ A y,y c goes to zero as ɛ goes to zero for all y y and c [0, C. Let c [0, C and y, y [n with y y A y,y c+ɛ \ A y,y c {p n : c < p (l y l y c + ɛ}. If l y l y de for soe d, we have that p (l y l y d and d > C by definition of C. Hence for sall enough ɛ the set A y,y c+ɛ \ A y,y c is epty. If l y l y is not a scalar ultiple of e, then p (l y l y is a non-constant linear function of p over n. Fro Lea 22, λ c+ɛ \ Ac y,y goes to zero as ɛ goes to zero. And by the absolute continuity of ν w.r.t. λ, we have ν c+ɛ \ A y,y c goes to zero as ɛ goes to zero. As the above arguents hold for any c [0, C and y, y [n with y y, the proof of part (b is coplete.

6 Part (c: We have, Consistent Multiclass Algoriths for Coplex Perforance Measures A 0 y,y [n,y y 0 A y,y 0. To show r(0 0, we show λ 0 A y,y 0 0 for all y y. Let y, y [n with y y, then 0 A y,y 0 {p n : p (l y l y 0}. If l y l y de for soe d 0, the above set is clearly epty. If l y l y is not a scalar ultiple of e, then p (l y l y is a non-constant linear function of p over n, and hence by Lea 22, we have that λ 0 A y,y 0 0. By the absolute continuity of ν w.r.t. λ we have that ν 0 A y,y 0 0. As the above arguents hold for any y, y [n with y y, the proof of part (c is coplete. Lea 24 (The uniqueness of ψ L -optial classifier. Let the distribution D over X Y be such that the easure over conditional probabilities ν, is absolutely continuous w.r.t. the base easure λ. Let L R n n be such that no two coluns are identical. Then, all ψ L optial classifiers have the sae confusion atrix. i.e. the iniizer over C D of L, C is unique. Proof. If x X is such that argin y [n η(x l y is a singleton, then any ψ L -optial classifier h is such that h (x argin y [n η(x l y We just show that the set of instances in X, such that argin y [n η(x l y is not a singleton, has easure zero. For any v R n, let (v (i be the i th eleent when the coponents of v are arranged in ascending order. ( µ {x X : argin y [n η(x l y > } µ ( {x X : (η(x L ( (η(x L (2 } ν ( {p n : (p L ( (p L (2 } Thus, by Lea 23 (part c, we have that set of instances in X such that argin y [n η(x l y is not a singleton, has easure zero. Thus any pair of ψ L -optial classifiers are sae µ alost everywhere, and hence all the ψ L -optial classifiers have the sae confusion atrix. Next we give the aster Lea which uses every result in this section, and will actually be the only tool in the proofs of Lea 2 and Theore 3. Lea 25 (Master Lea. Let the distribution D over X Y be such that the easure over conditional probabilities ν, be absolutely continuous w.r.t. the base easure λ. Let L [0, n n be such that no two coluns are identical. Then, Moreover, the above set is a singleton. argin C CD L, C argin C CD L, C. Proof. The first part of the proof where one shows argin C CD L, C is a singleton is exactly what is given by Lea 24. Let C C D [h argin C CD L, C. The classifier h : X n is such that C D [h C, and is fixed for convenience as the following classifier, h (x rand(argin y [n η(x l y.

7 Also let f : n n be such that, h f η, i.e. Consistent Multiclass Algoriths for Coplex Perforance Measures f (p rand(argin y [n p l y. In the rest of the proof we siply show that C is the unique iniizer of L, C over C D as well. To do so, we first assue that C argin C CD L, C and C C. We then go on to show a contradiction, the brief details of which are given below:. As C C, we have C C ξ > As C C D, there exists a sequence of classifiers h, h 2,..., such that their confusion atrices converge to C. 3. The confusion atrices of the classifiers h, h 2,..., are bounded away fro C as ξ > Due to the continuity of the C D apping (Lea 2, the classifiers h, h 2,... are also bounded away fro h i.e. they ust predict differently fro h on a significant fraction of the instances. 5. Due to Lea 23, we have that for ost instances the second best prediction (in ters of loss L is significantly worse than the best prediction. The classifiers h, h 2,... all predict differently fro h (which always predicts the best label for any given instance for a large fraction of the instances, hence they ust predict a significantly worse label for a large fraction of instances. 6. Fro the above reasoning, the classifiers h, h 2,..., all perfor worse by a constant additive factor than h, on the ψ L perforance easure. But, as the the confusion atrices of these classifiers converge to C, the ψ L perforance of these classifiers ust approach the optial. Thus providing a contradiction. The full details of the above sketch is given below. Let C C ξ > 0. As C C D, we have that for all ɛ > 0, there exists C ɛ C D, such that C ɛ C ɛ. By triangle inequality, this iplies that C ɛ C ξ ɛ, (3 Let f ɛ : n n be s.t. C ɛ C D [f ɛ η. Now we describe the set of conditional probabilities p n for which f ɛ (p differs significantly fro f (p. Denote this bad set as B. Let B {p n : f (p f ɛ (p ξ 4 }. We now show that this set is large. Applying Eq. 3 and Lea 2 we have ξ ɛ C D [f η C D [f ɛ η f (p f ɛ (p dν(p p n ξ 2dν(p + 4 dν(p p B p / B 2ν(B + ξ ( ν(b 4 2ν(B + ξ 4 ν(b 3ξ 8 ɛ 2 (4

8 Now we consider the set of conditional probabilities p n, such that the second best and best predictions (in ters of L are close in perforance. We show that this set is sall. For any c > 0, define A c n as A c {p n : (p L (2 (p L ( c}. Fro Lea 23 we have that ν(a c is a continuous function of c close to 0 and ν(a 0 0. Let c > 0 be such that Fro Eq. 4 and 5, we have ν(a c ξ 6. (5 ν(b \ A c 5ξ 6 ɛ 2 Any p B \ A c is such that f ɛ (p is different fro f (p, and the second best prediction is significantly worse than the best prediction, i.e. (p L (2 (p L ( > c and f (p f ɛ (p ξ 4. For any p n, we have f (p n, has a at the index argin y [n p l y and zero elsewhere. For any p B \ A c, we have f (p f ɛ (p ξ 4, and hence the value of f ɛ(p corresponding to the index argin y [n p l y, is at ost ( ξ 8. In particular, we have p Lf ɛ (p ( ξ (p L ( + 8 ( ξ (p L (2. (6 8 By using Lea 9 and Eq. 6 we have, L, C ɛ L, C p L(f ɛ (p f (pdν(p p n p L(f ɛ (p f (pdν(p + p L(f ɛ (p f (pdν(p p B\A c p n\(b\a c p L(f ɛ (p f (pdν(p p B\A c (( ξ ( ξ (p L ( + (p L (2 (p L ( dν(p p B\A c 8 8 ξ ( (p L (2 (p L ( dν(p p B\A c 8 ξc 8 ( 5ξ 6 ɛ 2 If ɛ ξ 2, we have The above holds for any ɛ (0, ξ 2, and both ξ and c do not depend on ɛ. However, we have C argin C CD L, C, and C ɛ C ɛ. Hence, L, C ɛ L, C ξ2 c 28. (7 L, C ɛ L, C + L, C ɛ C L, C + L C ɛ C

9 L, C + ɛ in L, C + ɛ L, C + ɛ (8 It can be clearly seen that, for sall enough ɛ, Eqs. 7 and 8 contradict each other. And thus we have that C C. B.5. Proof of Lea 2 Lea (Existence of Bayes optial classifier for onotonic ψ. Let D be such that the probability easure associated with the rando vector η(x (η (X,..., η n (X is absolutely continuous w.r.t. the base probability easure associated with the unifor distribution over n, and let ψ be a perforance easure that is differentiable and bounded over C D, and is onotonically increasing in C ii for each i and non-increasing in C ij for all i, j. Then h : X n s.t. P ψ D [h P ψ, D. Proof. Let C argax C CD ψ(c. Such a C always exists by copactness of C D and continuity of ψ. We will show that this C is also in C D, thus proving the existence of h : X n which is such that C C D [h and hence P ψ D [h P ψ, D. By first order optiality, and convexity of C D, we have that for all C C D ψ(c, C ψ(c, C. Let L be the scaled and shifted version of ψ(c with entries in [0,, then we have that C argin C CD L, C. Due to the onotonicity condition on ψ the diagonal eleents of its gradient ψ(c are positive, and the off-diagonal eleents are non-positive, and hence no two coluns of L are identical. Thus by Lea 25, we have that C C D. B.6. Proof of Theore 3 Theore (For of Bayes optial classifier for onotonic ψ. Let D, ψ satisfy the conditions of Lea 2. Let h : X n be a ψ-optial classifier and let C C D [h. Let L ψ(c, and let L [0, n n be obtained by scaling and shifting L so its entries lie in [0,. Then any classifier that is ψ L -optial is also ψ-optial. Proof. Clearly ψ(c ax ψ(c. Hence by the differentiability of ψ, first order conditions for optiality and convexity of C D we have C C D, ψ(c, C ψ(c, C. By definition of L, this iplies that C C D, L, C L, C. Thus, we have that h is a ψ L -optial classifier. Due to the onotonicity condition on ψ the diagonal eleents of its gradient ψ(c are positive, and the off-diagonal eleents are non-positive, and hence no two coluns of L are identical. By Lea 24 (or Lea 25, we have that all ψ L -optial classifiers have the sae confusion atrix, which is equal to C D [h C. And thus all ψ L optial classifiers are also ψ-optial.

10 C. Suppleentary Material for Section 5 (Consistency C.. Proof of Lea 4 Lea (L-regret of ulticlass plug-in classifiers. For a fixed L [0, n n and class probability estiation odel η : X n, let ĥ : X [n be a classifier ĥ(x argin j [n n i η i(xl ij. Then P L, D PL D[ĥ E X[ η(x η(x. Proof. Let h : X n be such that By Proposition 6, we have that h (x argin y [n l y η(x. h argax h:x n P L D[h. We then have as desired. P L, D PL D[ĥ L, CD [ĥ L, CD [h [ E X η(x [ lĥ(x EX η(x l h (X E X [ η(x [ lĥ(x + EX (η(x η(x [ lĥ(x EX η(x l h (X E X [ η(x [ l h (X + EX (η(x η(x [ lĥ(x EX η(x l h (X [ E X (η(x η(x (lĥ(x l h (X [ E X η(x η(x lĥ(x l h (X C.2. Proof of Lea 5 E X [ η(x η(x, Lea (Unifor convergence of confusion atrices. For q : X n, let H q {h:x [n, h(x argin j [n n q i (xl ij L [0, n n }. Let S (X [n be a saple drawn i.i.d. fro D. For any δ (0,, w.p. at least δ (over draw of S fro D, sup C D [h ĈS [h n 2 log(n log( + log(n 2 /δ C, h H q where C > 0 is a distribution-independent constant. Proof. For any a, b [n we have, sup Ĉa,b[h S Ca,b[h D sup h H q h H q sup h H b q i ((y i a, h(x i b E[(Y a, h(x b ((y i a, h(x i E[(Y a, h(x, i i where for a fixed b [n, H b q {h : X {0, } : L [0, n n, x X, h(x (b argin y [n l y q(x}. The set H b q can be seen as hypothesis class whose concepts are the intersection of n halfspaces in R n (corresponding to q(x

11 through the origin. Hence we have fro Lea of Bluer et al. (989 that the VC-diension of H b q is at ost 2n 2 log(3n. Fro standard unifor convergence arguents we have that for each a, b [n, the following holds with at least probability δ, sup Ĉa,b[h S Ca,b[h D n 2 log(n log( + log( δ C h H q where C > 0 is soe constant. Applying union bound over all a, b [n we have that the following holds with probability at least δ sup Ĉ S [h C D n 2 log(n log( + log( n2 δ [h C. h H q C.3. Proof of Theore 6 Theore (ψ-regret of Frank-Wolfe ethod based algorith. Let ψ : [0, n n R + be concave over C D, and L-Lipschitz and β-sooth w.r.t. the l nor. Let S (S, S 2 (X [n be a training saple drawn i.i.d. fro D. Further, let η : X n be the CPE odel learned fro S in Algorith and h FW S : X n be the classifier obtained after κ iterations. Then for any δ (0,, with probability at least δ (over draw of S fro D, P ψ, D [ Pψ D [hfw S 4LE X η(x η(x + 4 2βn 2 C n 2 log(n log( + log(n 2 /δ + 8β κ + 2, where C > 0 is a distribution-independent constant. We first prove an iportant lea where we bound the approxiation error of the linear optiization oracle used in the algorith using Lea 4 and 5. This result coupled with the standard convergence analysis for the Frank-Wolfe ethod (Jaggi, 203 will then allow us to prove the above theore. Lea 26. Let ψ : [0, n n R + be concave over C D, and L-Lipschitz and β-sooth w.r.t. the l nor. Let classifiers ĝ,..., ĝ T, and h 0, h,..., h T be as defined in Algorith. Then for any δ (0,, with probability at least δ (over draw of S fro D, we have for all t T ψ(c D [h t, C D [ĝ t ax g:x : n ψ(c D [h t, C D [g ɛ S where ɛ S 2LE X [ η(x η(x + 2 2Cβn 2 n 2 log(n log( + log( n2 δ. Proof. For any t T, let g t, argax g:x n ψ(c D [h t, C D [g. We then have ax ψ(c D [h t, C D [g ψ(c D [h t, C D [ĝ t g:x : n ψ(c D [h t, C D [g t, ψ(c D [h t, C D [ĝ t ψ(c D [h t, C D [g t, ψ(ĉs2 [h t, C D [g t, }{{} ter + ψ(ĉs2 [h t, C D [g t, ψ(ĉs2 [h t, C D [ĝ t }{{} ter 2 + ψ(ĉs2 [h t, C D [ĝ t ψ(c D [h t, C D [ĝ t. }{{} ter 3

12 We next bound each of these ters. We start with ter 2. For any t T, let L t be as defined in Algorith. Since L t is a scaled and translated version of the gradient ψ(ĉs2 [h t, we have ψ(ĉs2 [h t c t a t Lt, for soe constant c t R and a t [0, 2L. Thus for all t T, ψ(ĉs2 [h t, C D [g t, ψ(ĉs2 [h t, C D [ĝ t where the third step uses the definition of P L t, D Next, for ter, we have by an application of Holder s inequality a t ( L t, C D [ĝ t L t, C D [g t, a t (P L t D [g t, P L t D [ĝ t a t (P L t, D P L t D [ĝ t 2L (P L t, D P L t D [ĝ t [ 2L E X η(x η(x, and the last step follows fro Lea 4. ψ(c D [h t, C D [g t, ψ(ĉs2 [h t, C D [g t, ψ( Ĉ S2 [h t ψ(c D [h t C D [g t, ψ(ĉs2 [h t ψ(c D [h t ( β Ĉ S2 [h t C D [h t βn 2 Ĉ S2 [h t C D [h t βn 2 ax Ĉ S2 [ĝ i C D [ĝ i i [t βn 2 sup h H η Ĉ S 2 [h C D [h, where the third step follows fro β-soothness of ψ. One can siilarly bound ter 3. We thus have for all t T, ax ψ(c D [h t, C D [g ψ(c D [h t, C D [ĝ t g:x n 2L E X [ η(x η(x + 2βn 2 sup h H η Ĉ S 2 [h C D [h. Applying Lea 5 with S 2 /2 exaples, we have with probability δ (over rando draw of S 2 fro D, for all t T, ax ψ(c D [h t, C D [g ψ(c D [h t, C D [ĝ t g:x n [ 2LE X η(x η(x + 2 2Cβn 2 n 2 log(n log( + log( n2 δ. We are now ready to prove Theore 6. Proof of Theore 6. Our proof shall ake use of Lea 26 and the standard convergence result for the Frank-Wolfe algorith for axiizing a concave function over a convex set (Jaggi, 203. We will find it useful to first define the following quantity, referred to as the curvature constant in (Jaggi, 203. C ψ sup C,C 2 C D,γ [0, 2 ( γ 2 ψ ( C + γ(c 2 C ψ ( C γ C2 C, ψ(c. Also, define two positive scalars ɛ S and δ apx required in the analysis of (Jaggi, 203: [ ɛ S 2LE X η(x η(x + 2 2Cβn 2 n 2 log(n log( + log( n2 δ

13 δ apx (T + ɛ S C ψ, where δ (0, is as in the theore stateent. Further, let the classifiers ĝ,..., ĝ T, and h 0, h,..., h T be as defined in Algorith. We then have fro Lea 26 that the following holds with probability at least δ, for all t T, ψ ( C D [ h t, C D [ ĝ t ax g:x : n ψ ( C D [ h t, C D [g ɛ S ax ψ ( C D [ h t, C ɛ S ( ax ψ C D [ h t, C ɛ S ( ax ψ C D [ h t, C 2 δ 2 apx T + C ψ ax ψ ( C D [ h t, C 2 δ 2 apx t + C ψ. (9 Also observe that for the two sequences of iterates given by the confusion atrices of the above classifiers, ( C D [h t 2 C D [h t + 2 t + t + CD [ĝ t,. (0 for all t T. Based on Eq. (9 and Eq. (0, one can now apply the result of (Jaggi, 203. In particular, the sequence of iterates C D [h 0, C D [h,..., C D [h T can be considered as the sequence of iterates arising fro running the Frank-Wolfe optiization ethod to axiize ψ over C D with a linear optiization oracle that is 2 δ apx 2 t+ C ψ accurate at iteration t. Since ψ is a concave function over the convex constraint set C D, one has fro Theore in (Jaggi, 203 that the following convergence guarantee holds with probability at least δ: P ψ D [hfw S ψ(c D [h FW S ψ(c D [h T ax ψ(c 2C ψ T + 2 ( + δ apx ax ψ(c 2C ψ T + 2 2ɛ S(T + T + 2 ax ψ(c 2C ψ T + 2 2ɛ S ( We can further upper bound C ψ in the above inequality in ters of the the soothness paraeter of ψ: 2 ( C ψ sup C,C 2 C D,γ [0, γ 2 ψ ( C + γ(c 2 C ψ ( C γ C2 C, ψ(c sup C,C 2 C D,γ [0, 4β, 2 γ 2 ( β 2 γ2 C C 2 2 where the second step follows fro the β-soothness of ψ. Substituting back in Eq. (, we finally have with probability at least δ, P ψ D [hfw S ax ψ(c 8β T + 2 2ɛ S P ψ, D 4LE [ X η(x η(x 4 2Cβn 2 which follows fro the definition of ɛ S. Setting T κ copletes the proof. n 2 log(n log( + log( n2 δ 8β T + 2,

14 C.4. Proof of Theore 7 Consistent Multiclass Algoriths for Coplex Perforance Measures A,C B,C Theore. Let ψ : [0, n n R + be such that ψ(c where A, B Rn n +, sup C CD ψ(c, and in C CD B, C b, for soe b > 0. Let S (S, S 2 (X [n be a training saple drawn i.i.d. fro D. Let η : X n be the CPE odel learned fro S in Algorith 2 and h BS S : X [n be the classifier obtained after κ iterations. Then for any δ (0,, with probability at least δ (over draw of S fro D, P, D P D[h BS S 2τE X [ η(x η(x + 2 2Cτ n 2 log(n log( + log(n 2 /δ where τ b ( A + B and C > 0 is a distribution-independent constant. We will find it useful to state the following leas: + 2 κ, Lea 27 (Invariant in Algorith 2. Let ψ be as defined in Theore 7. Let H η {h : X [n, h(x argin n j [n i η i(xl ij L [0, n n }. Then the following invariant is true at the end of each iteration 0 t T of Algorith 2: α t τ ɛ < ψ(c D [h t sup ψ(c β t + τ ɛ, where τ b ( A + B and ɛ E X [ η(x η(x + suph H η C D [h ĈS2 [h. Proof. We first have fro Lea 4, the following guarantee for the linear iniization step at each iteration t of Algorith 2: L t, C D [ĝ t in L t, C + E X [ η(x η(x Further, let us denote ɛ sup h Hη C D [h ĈS2 [h. Notice that ɛ ɛ + ɛ. in L t, C + ɛ (say. (2 We shall now prove this lea by atheatical induction on the iteration nuber t. For t 0, the invariant holds trivially as 0 ψ(c D [h 0. Assue the invariant holds at the end of iteration t {0,..., T }; we shall prove that the invariant holds at the end of iteration t. In particular, we consider two cases at iteration t. In the first case, ψ( Γ t γ t, leading to the assignents α t γ t, β t β t, and h t ĝ t. We have fro the definition of ɛ A γ t B, C D [ĝ t A γ t B, Γ t A γ t B ɛ A, Γ t γ t B, Γ t A γ t B ɛ B, Γ t ( ψ( Γ t γ t A γ t B ɛ 0 A γ t B ɛ > A γ t B (2ɛ + ɛ ( A + B (2ɛ + ɛ, where the third step follows fro our case assuption that ψ( Γ t γ t and B, Γ t > 0, the fifth step follows fro ɛ > 0, and the last step follows fro triangle inequality and γ t sup C CD ψ(c. The above inequality further gives us A, C D [ĝ t B, C D [ĝ t > γ t A + B B, C D [ĝ t (2ɛ + ɛ γ t b ( A + B (2ɛ + ɛ γ t τ(2ɛ + ɛ γ t τ ɛ α t τ ɛ,

15 where the second step follows fro in B, C > b and the last step follows fro the assignent α t γ t. In other words, ψ ( C D [h t ψ(c D [ĝ t A, CD [ĝ t B, C D [ĝ t > αt τ ɛ. Moreover, by our assuption that the invariant holds at the end of iteration t, we have β t + τ ɛ β t + τ ɛ ax ψ(c ψ(c D [h t > α t τ ɛ. Thus under the first case, the invariant holds at the end of iteration t. In the second case, ψ( Γ t < γ t at iteration t, which would lead to the assignents α t α t, β t γ t, and h t h t. Since the invariant is assued to hold at the end of iteration t, we have α t τ ɛ α t τ ɛ ψ(c D [h t ψ(c D [h t. (3 Next, recall that L t [0, n n is a scaled and translated version of (A γ t B; clearly, there exists c t R and 0 < a t 2 A γ t B such that A γ t B c t a t L t. Then for C argax A γ t B, C, we have C C D A γ t B, C c t a t L t, C c t a t L t, C D [h t + a t ɛ c t a t L t, C D [h t + 2 A γ t B ɛ A γ t B, C D [h t + 2 A γ t B ɛ A γ t B, Γ t + A γ t B ɛ + 2 A γ t B ɛ A, Γ t γ t B, Γ t + A γ t B ɛ + 2 A γ t B ɛ B, Γ t (ψ( Γ t γ t + A γ t B ɛ + 2 A γ t B ɛ B, Γ t (0 + A γ t B ɛ + 2 A γ t B ɛ A γ t B (2ɛ + ɛ ( A + B (2ɛ + ɛ, where the second step follows fro Eq. (2, the third step uses a t A γ t B, the fifth step follows fro the definition of ɛ and Holder s inequality, the seventh step follows fro our case assuption that ψ( Γ t γ t and B, Γ t > 0, and the last step follows fro triangle inequality and γ t sup ψ(c. In particular, we have for all C C D, or A γ t B, C ( A + B (2ɛ + ɛ, A, C B, C γt + A + B (2ɛ + ɛ. B, C Since in B, C > b, we have fro the above, for all C C D, In other words, ψ(c γ t + b ( A + B (2ɛ + ɛ γ t + τ ɛ β t + τ ɛ. sup ψ(c β t + τ ɛ. By cobining the above with Eq. (3, we can see that the invariant holds in iteration t under this case as well. This copletes the proof of the lea.

16 Lea 28 (Multiplicative Progress in Each Iteration of Algorith 2. Let ψ be as defined in Theore 7. Then the following is true in each iteration t T of Algorith 2: β t α t 2( β t α t. Proof. We consider two cases in each iteration of Algorith 2. If in an iteration t {,..., T }, ψ( Γ t γ t, leading to the assignent α t γ t, then β t α t β t γ t β t αt + β t 2 2( β t α t. On the other hand, if ψ( Γ t < γ t, leading to the assignent β t γ t, then β t α t γ t α t αt + β t 2 Thus in both cases, the stateent of the lea is seen to hold. We now prove Theore 7. 2 (βt α t. α t Proof of Theore 7. For the classifier h BS S ht output by Algorith 2 after T iterations, we have fro Lea 27 P, D P D[h BS S sup ψ(c ψ ( C D [h T < (β t + τ ɛ (α t τ ɛ β t α t + 2τ ɛ 2 T ( β 0 α 0 + 2τ ɛ 2 T ( 0 + 2τ ɛ 2 T + 2τ ɛ, where ɛ is as defined in Lea 27; the fifth step above follows fro Lea 28. Setting T κ thus gives us P, D P D[h BS S 2τE X [ η(x η(x + 2τ sup h H η C D [h ĈS2 [h + 2 κ. By an application Lea 5 to the second ter in the right-hand side of the above inequality (noting that S 2 /2, we then have for any δ > 0, with probability at least δ, P, D P D[ĥBS S 2τE X [ η(x η(x + 2 2Cτ n 2 log(n log( + log(n 2 /δ for a distribution-independent constant C > 0. C.5. Extending Algorith to Non-Sooth Perforance Measures + 2 κ, In Section 5, we showed that Algorith was consistent for any concave sooth perforance easure (see Theore 6. We now extend this result to concave perforance easures for which the associated ψ is non-sooth (but differentiable; these include the G-ean, H-ean and Q-ean perforance easures in. In particular, for these perforance easures, we prescribe that Algorith be applied to a suitable sooth approxiation to ψ; if the quality of this approxiation iproves with the size of the given training saple (at an appropriate rate, then the resulting algorith can be shown to be consistent for the original perforance easure.

17 Theore 29. Let ψ : [0, n n R + be such that for any ρ R +, there exists ψ ρ : [0, n n R + which is concave over C D, L ρ Lipschitz and β ρ sooth w.r.t. the l nor with sup ψ(c ψ ρ (C θ(ρ, C C D for soe strictly increasing function θ : R + R +. Let S (S, S 2 (X [n be a training saple drawn i.i.d. fro D. Further, let η : X n be the CPE odel learned fro S in Algorith and h FW,ρ S : X n be the classifier obtained after κ iterations by Algorith when run for the perforance easure ψ ρ. Then for any δ (0,, with probability at least δ (over draw of S fro D, P ψ, D [ Pψ D [hfw,ρ S 4L ρ E X η(x η(x + 4 2βρ n 2 C where C > 0 is a distribution-independent constant. n 2 log(n log( + log(n 2 /δ + 8β ρ κ θ(ρ, Proof. Fro Theore 6 we have that P ψρ, D P ψρ D [hfw,ρ S 4L ρ E X [ η(x η(x + 4 2βρ n 2 C n 2 log(n log( + log(n 2 /δ + 8β ρ κ + 2. (4 For siplicity assue that the ψ-optial classifier exists; the proof can be easily extended when this is not the case. Let h : X n be a ψ-optial classifier; note that this classifier need not be ψ ρ -optial. We then have that P ψ, D Pψ D [hfw,ρ S ψ(c D [h ψ(c D [h FW,ρ S ψ ρ (C D [h ψ ρ (C D [h FW,ρ S + 2θ(ρ P ψρ D [h P ψρ P ψρ, D P ψρ D [hfw,ρ S D [hfw,ρ S + 2θ(ρ + 2θ(ρ 4L ρ E X [ η(x η(x + 4 2βρ n 2 C n 2 log(n log( + log(n 2 /δ + 8β ρ κ θ(ρ, where the second step follows fro our assuption that sup C CD ψ(c ψ ρ (C θ(ρ, and the fifth step follows fro the definition of P ψρ, and the last step uses Eq. (4. This copletes the proof. We note that for each of G-ean, H-ean and Q-ean, one can construct a Lipschitz sooth approxiation ψ ρ as required in the above theore. Now, suppose the CPE algorith in Algorith is such that the class probability estiation error ter in the theore E X [ η(x η(x P 0 (as the nuber of training exaples. Then for each of the given perforance easures, one can allow the paraeter ρ (that deterines the approxiation quality of ψ ρ to go to 0 as (at appropriate rate, so that the right-hand side of the bound in the theore goes to 0 (as, iplying that Algorith is ψ-consistent. We postpone the details to a longer version of the paper. D. Suppleentary Material for Section 6 (Experients D.. Coputation of Class Probability Function for Distribution Considered in Synthetic Data Experients We provide the calculations for the class probability function for the distribution considered in synthetic data experients in Section 6. We present this for a ore general distribution over R d [n, where for each class i, the class prior probability is π i and the class conditional distribution is a Gaussian distribution with ean µ i R d and the sae (syetric positive seidefinite covariance atrix Σ R d d. We shall denote the pdf for the Gaussian corresponding to class i as f i (x exp ( (2π d Σ 2 (x µ i Σ (x µ i. The class probability function for this distribution is then given by η i (x P(Y i X x π i f i (x n j π jf j (x

18 Table 6. Data sets used in experients in Sections Data set # instances # features # classes UCI car pageblocks glass satiage covtype yeast abalone IR cora news rcv exp ( 2 (x µ i Σ (x µ i + ln π i n j exp ( 2 (x µ j Σ (x µ j + ln π j exp ( ( µ i Σ x 2 µ i Σ µ i + ln π i exp 2 x Σ x n j exp ( ( µ j Σ x 2 µ j Σ µ j + ln π j exp 2 x Σ x exp(wi x + b i n j exp(w j x + b j, where w i Σ µ i and b i 2 µ i Σ µ i + ln π i. Clearly, the class probability function for the distribution considered can be obtained as a softax of a linear function. D.2. Additional Experiental Details/Results In all our experients, the regularization paraeter for each algorith was chosen fro the range {0 4,..., 0 4 } using a held-out portion of the training set. Synthetic data experients. Since the distribution used to generate synthetic data and the four perforance easures considered satisfy the condition in Theore 3, the optial classifier for each perforance easure can be obtained by coputing the ψ L -optial classifier for soe loss atrix L [0, n n ; we have a siilar characterization for icro F using Theore 3. In our experients, we coputed the optial perforance for a given perforance easure by perforing a search over a large range of n n loss atrices L, used the true conditional class probability to copute a ψ L -optial classifier for each such L (see Proposition 6, and chose aong these classifiers the one which gave the highest perforance value (on a large saple drawn fro the distribution. Moreover, since the class probability function here is a softax of linear functions, it follows that the Bayes optial perforance is also achieved by a linear classification odel, and therefore learning a linear odel suffices to achieve consistency; we therefore learn a linear classification odel in all experients. Also, recall that Algorith outputs a randoized classifier, while Algorith 2 outputs a deterinistic classifier. In our experiental results, the ψ-perforance of a randoized classifier was evaluated using the expected (epirical confusion atrix of the deterinistic classifiers in its support. Real data experients. All real data sets used in our experients have been listed in Table 6. The version of the CoRA data set used in our experients was obtained fro The 20 Newsgroup data was obtained fro jason/20newsgroups/. For the RCV data, we used a preprocessed version obtained fro cjlin/libsvtools/ datasets/. For each of the UCI data sets used in our experients, the training set was noralized to 0 ean and unit variance, and this transforation was applied on the test set. Table 7 9 contains results on UCI data sets not provided in Section 6. Table 0 2 contains training ties for Algorith (applied to the G-ean, H-ean and Q-ean easures and the baseline SVM perf and 0- logistic regression ethods on all UCI data sets; in each case, the sybol against SVM perf indicates the ethod did not coplete after 96 hrs. Ipleentation details. The proposed Frank-Wolfe based and bisection based algoriths were ipleented in MAT- LAB; in order to learn a CPE odel in these algoriths, we used the ulticlass logistic regression solver provided in schidt/software/infunc.htl for the experients on the synthetic and UCI data sets, and the liblinear logistic regression ipleentation provided in cjlin/

19 car pageblocks glass satiage covtype yeast abalone Frank-Wolfe (GM SVM perf (GM LogReg ( Table 7. Results for G-ean on UCI data sets. car pageblocks glass satiage covtype yeast abalone Frank-Wolfe (HM SVM perf (HM LogReg ( Table 8. Results for H-ean on UCI data sets. car pageblocks glass satiage covtype yeast abalone Frank-Wolfe (QM SVM perf (QM LogReg ( Table 9. Results for Q-ean on UCI data sets. car pageblocks glass satiage covtype yeast abalone Frank-Wolfe (GM SVM perf (GM LogReg ( Table 0. Training tie (in secs for G-ean on UCI data sets. car pageblocks glass satiage covtype yeast abalone Frank-Wolfe (HM SVM perf (HM LogReg ( Table. Training tie (in secs for H-ean on UCI data sets. car pageblocks glass satiage covtype yeast abalone Frank-Wolfe (QM SVM perf (QM LogReg ( Table 2. Training tie (in secs for Q-ean on UCI data sets. liblinear for the experients on IR data. All run-tie experients were run on Intel Xeon quad-core achines (2.66 GHz, 2 MB cache with 6 GB RAM. We ipleented SVM perf using a publicly available structural SVM API 9. The SVM perf ethod (proposed originally for binary perforance easures (Joachis, 2005 uses a cutting plane solver where coputing the ost-violated constraint requires a search over of all valid confusion atrices for the given training saple. In the case of the G-ean, H-ean and Q-ean easures, this search can be restricted to the diagonal entries of the confusion atrix, but will still require (in the worst case tie exponential in the nuber of classes; in the case of the icro F, this search is ore expensive and will involve searching over 3n 3 entries of the confusion atrix. While we use an exact ipleentation of SVM perf for these three perforance easures, for the icro F -easure, we use a version that optiizes an approxiation to the icro F (in particular, optiizes the variant of icro F analyzed by (Parabath et al., 204 and requires fewer coputations. The tolerance paraeter for the cutting-plane ethod in SVM perf was set to 0.0 for all experients except on the Pageblocks and CoRA data sets, where the tolerance was set to 0. to enable faster run-tie. 9

Understanding Machine Learning Solution Manual

Understanding Machine Learning Solution Manual Written by Alon Gonen Edited by Dana Rubinstein Noveber 17, 2014 2 Gentle Start 1. Given S = ((x i, y i )), define the ultivariate polynoial p S (x) = i []:y