Convexity, Detection, and Generalized f-divergences

Size: px

Start display at page:

Download "Convexity, Detection, and Generalized f-divergences"

Reynard Butler
5 years ago
Views:

1 Convexity, Detection, and Generalized f-divergences Khashayar Khosravi Feng Ruan John Duchi June 5, Introduction The goal of classification problem is to learn a discriminant function for classification of forthcoming data. Given data coming in i.i.d. pairs X i, Y i, for 1 i n, from some underlying distributions X X, and Y Y = {1,, 3,..., k}, we aim to design a classifier, γ, with smallest 0-1 loss, given by PγX Y. However, minimizing 0-1 loss directly is often computationally intractable and practical algorithms are based on convex relaxations of 0-1 loss, say Φ, which is called the surrogate loss. It is natural to ask whether solving the minimization problem over Φ leads to a minimizer of 0-1 loss; this is called consistency. In contrast to the binary classification, in the multi-class classification problem analysis of consistency of Φ requires more sophisticated approaches. The prior Bayes Φ-risk is defined as the optimal achievable risk by having access only to probability of labels. Likewise, the posterior Bayes Φ-risk is the optimal achievable risk once we further have access to posterior densities, given by p c x = px Y = c, for 1 c k. The difference is an important quantity as it indicates the amount of information that X has. In the binary classification problem, this gap was shown in [LV06] to be equal to an f-divergence between probability distributions P 1 and P, for a convex f : R R. In many applications, the covariates used for solving the minimization problem, are either received after passing through a dimension reducing quantizer, or are deliberately transformed using a feature selection stage in order to obtain a more meaningful interpretation. Decentralized detection problem, as described in [Tsi93], is an example of the first situation where we collect information from sensors located in different places and we aim to solve a hypothesis testing problem at a fusion center. Regarding to constraints on sensors power and the communication channels, we need to send quantized version of collected information to the fusion center. In experimental design problems, we often encounter the second situation, since we discard some information that are useless in our analysis. For modeling purposes, let q : X Z = {1,,..., m} be a quantizer. Our aim is jointly learning a classifier γ : Z Y and a quantizer q, by minimizing the misclassification rate, given by PγqX Y. Considering any convex surrogate loss Φ, we can find the optimal risk of quantizer q, denoted as R Φ q, by empirically minimizing E[Φ Y γqx] over γ. One fundamental question is finding all surrogate losses Φ that are universally equivalent to the 0-1 loss functions, in the sense that, for all underlying probability distributions on X Y, 1

2 the Φ loss and 0-1 loss induce the same ordering on the optimal risk of quantizers. By doing so, we are able to identify all convex surrogate losses that not only classification problem for them is easy to solve, but they also give us the same optimal classifier γ and quantizer q as the 0-1 loss, regardless of underlying distribution. For binary case, this problem can be written in terms of corresponding f-divergences and is studied in [NWJ09]. The main goal of this report is to define a generalized version of f-divergences, study its properties, and prove a general theorem for characterizing universally equivalent losses in the multi-class setting in terms of generalized f-divergences. We will also show that 0-1 loss is equivalent to the some multi-dimensional hinge loss that is widely-used in practice. Generalized f-divergences Let P 1,..., P k be probability distributions on a common σ-algebra F over a set X. Let f : R + R be a convex function satisfying f1 = 0. Let µ be any measure such that P i µ for all i, and let p i = dp i /dµ denote the density of P i with respect to µ. Then we define the generalized f-divergence between P 1,..., P k as p1 x D f P 1,..., P P k := f p k x,..., p x p k xdµx. 1 p k x Lemma.1. The value of generalized f-divergence, defined in 1, does not depend on the base measure µ. We now illustrate a few properties of these multi-way f-divergences, showing how they extend classical divergences. Given a a measurable mapping q : X N with a finite range, we define D f P 1,..., P P k q = P1 A f P k A,..., P A P k A. P k A A q 1 N This is the quantized version of the generalized f-divergences. When the function f is well-behaved, by choosing the quantizer wisely, we will not lose any information. Lemma.. Let f be a closed convex function. Then D f P 1,..., P P k = sup D f P 1,..., P P k q, q where the supremum is over quantizers of X. For proving one of the most important properties of f-divergences, i.e., the data processing inequality, we first need the following lemma: Lemma.3. Let f : R k + R be convex. Let a : X R + and b i : X R +, for 1 c k, be nonnegative measurable functions. Then for any finite measure µ defined on X, we have b1 dµ bk dµ b1 x adµ f,..., axf adµ adµ ax,..., b kx dµx. ax

3 Proposition.4. Let f : R k + R be a convex function such that f1 = 0. Let Q be a Markov kernel from X to Z, that is, Q x is a probability distribution on Z for each x X. Define the marginals Q Pi A := X QA xdp ix for each i = 1,..., k. Then Proof. By Lemma., we have that D f QP1,..., Q P Q Pk Df P 1,..., P P k. D f QP1,..., Q P Q Pk = sup { Df QP1,..., Q P Q Pk q : q is a quantizer of Z }. It is consequently no loss of generality to assume that Z is finite and Z = {1,,..., m}. Let µ = k P c be a dominating measure and let p c = dp c /dµ. Then letting q Pc i = Q Pc {i}, we obtain m qp1 i D f QP1,..., Q P Q Pk = q Pk if q Pk i,..., q P i q Pk i = X X qi xp k xdµx f qi xf p1 x p k x,..., p x p k x qi xp 1xdµx X qi xp X kxdµx,..., p k xdµx. qi xp X xdµx qi xp X kxdµx by Lemma.3. Noting that m qi x = 1, we thus obtain our desired result. 3 Correspondence between risks and f-divergences In order to state the correspondence between risks and f-divergences, let us introduce some notations. Denote π c = PY = c as the prior distribution on Y and p c x = px Y = c as the posterior density, for 1 c k. Each quantization method q, induces a conditional probability distribution Q c z = PY = c qx = z on Z, for almost all y. Let Ψ c α be the loss associated with Y = c, where α R k, with the condition 1 T α = 0, is the decision vector. For each discriminant function γ : Z R k, the risk, defined as the expectation of loss, is equal to R Φ γ q = E[Φ Y γqx] = = = [Φ Y γi qx = i PqX = i] [ k Φ c γi PY = c qx = i ] PqX = i [ k π c Φ c γip c qx = i ]. 3

4 By taking the infimum over γ, we get the quantized posterior Bayes Φ-risk: R Φ q = inf γ R Φγ q = inf α T 1=0 k π c Φ c αp c A i, where A i = {x X qx = i} = q 1 i. Now we have the following proposition Proposition 3.1. For any loss function Φ = Φ 1, Φ,..., Φ k and any quantization rule q : X {1,,..., m}, define quantized Φ-statistical information as the gap between the prior Bayes Φ-risk and the posterior Bayes Φ-risk. Then, Φ-statistical information can be written as a generalized f-divergence. That is I Φ q = R Φ R Φ q = D fφ P 1, P,..., P P k q, for some closed convex function f Φ. Further, f Φ is given by { f Φ t 1, t,..., t = sup R Φ π k Φ k α π c Φ c αt c }. 3 1 T α=0 Proof. The proof is easy. By expanding the Φ-statistical information I Φ q = R Φ R Φ q = = sup α T 1=0 R Φ sup α T 1=0 k R Φ k π c Φ c αp c A i π c Φ c α P ca i P k A i = D fφ P 1, P,..., P P k q. P k A i It is easy to check that 3 satisfies above equality and hence we are done. 4 Universal equivalent class characterization As discussed in section 1, it is interesting to study whether there exist two surrogate loss functions that induce the same ordering on the optimal risk of quantizers. More precisely: Definition Two surrogate loss functions Φ 1 and Φ are universally equivalent if for any probability distribution PX, Y on X Y, and any two quantization rules q 1 and q on X, there holds R Φ1 q 1 R Φ1 q R Φ q 1 R Φ q. It is easy to translate this definition in terms of f 1 and f which are f-divergences associated to Φ 1 and Φ, respectively. This is true according to the Proposition

5 Remark Let f 1 and f be f-divergences associated with Φ 1 and Φ respectively. Then Φ 1 and Φ are universally equivalent if for any probability distribution PX, Y on X Y, and any two quantization rules q 1 and q on X, we have D f1 P 1, P,..., P P k q 1 D f1 P 1, P,..., P P k q D f P 1, P,..., P P k q 1 D f P 1, P,..., P P k q, where for 1 c k, P c is the conditional distribution of X given Y = c. Observe that this definition is very stringent, in that it requires that the ordering between optimal Φ 1 and Φ risks holds for all probability distributions P on X Y. The following theorem provides necessary and sufficient conditions under which universal equivalency holds. Theorem 4.1. Let f 1, f : R K R be closed convex f-divergences associated with Φ 1 and Φ, respectively. Then Φ 1 and Φ are universally equivalent if and only if f 1 t = af t + b T t + c for some constants a > 0, b R K, c R. Due to space constraints, the proof of the theorem is left to the appendix. This theorem is the main result of paper and we now discuss how using this theorem can help us to solve minimization of 0-1 loss problem. In the next section we compute the f-divergences associated with 0-1 and hinge loss to confirm their equivalence. 5 Universal equivalence of 0-1 and hinge loss Example First consider the 0-1 loss, which is defined as the probability of misclassification. That is, the loss is zero if we choose the correct label and is 1 otherwise. This loss can be expressed in many ways as Φ 0 1 α = Φ α,..., Φ 0 1 α, one such is Φ 0 1 c α = 1 Iα c > α j for 1 j c 1 and α c α j for c + 1 j k. This definition ensures that exactly one of Φ 0 1 c, for 1 c k is zero and the others are one. The prior Bayes risk is R Φ 0 1 = inf α T 1=0 k k π c Φc 0 1 α = 1 max π c, 1 c k While the posterior Bayes risk in the presence of the quantizer q is k R Φ 0 1q = π c P c A i max π c P c A i = 1 max π c P c A i. c c Thus, the f-divergence associated to 0-1 loss is equal to { } f Φ 0 1t = max max t iπ i, π k 1 i max 1 i k π i. 4 5

6 Example Denote φ hinge as the hinge loss, defined by φ hinge α = 1 α + = max{1 α, 0}. There are many ways for expanding this loss function to the multi-dimensional setting. One general model for this purpose, introduced in [LLW04], is Φ c α = j c φ α j, 5 for 1 c k. We also impose the constraint α T 1 = 0 in 5. This definition was shown in [Zha04, TB07], to be consistent whenever φ is convex, decreasing, and φ 0 < 0. We here also adopt this model and for 1 c k, define Φ hinge c α = j c φhinge α j. By computing prior Bayes Φ hinge -risk we get k k R Φ hinge = π c 1 + α j + = inf 1 π c 1 + α c +. 6 inf α T 1=0 j c α T 1=0 This convex optimization problem can be solved using KKT conditions. Define α as { αi k 1 if i = M, = 1 if i M. It is easy to check α is a minimizer of infimum in 6. Therefore, The posterior risk can also be calculated as R Ψ hingeq = k R Ψ hinge = k1 max 1 c k π c. 1 k max π cp c A i. 1 c k Hence, the f-divergence corresponding to the hinge loss is { } f Φ hinget = k max max t iπ i, π k max π i 1 i 1 i k = kf π,ψ 0 1t, where the last equality comes from 4. According to Theorem 4.1 we observe that 0-1 and the hinge loss are universally equivalent. More generally, using 4 we can state the following corollary of Theorem 4.1: Corollary 5.1. All loss functions Φ that are universally equivalent to the 0 1 loss, induce f-divergences of the form { } f Φ t = a max max t iπ i, π k max π i + b T t + c 1 i 1 i k for some constants a > 0, b R, c R. 6

7 Appendix A Proof of Lemma.1 Let µ 1 and µ be dominating measures; then µ = µ 1 + µ also dominates P 1,..., P k as well as µ 1 and µ. We have for ν = µ 1 or ν = µ that dp i /dν dp k /dν = dp i/dν dν/dµ dp k /dν dν/dµ = dp i/dµ dp k /dµ and dp i dν dν dµ = dp i dµ, the latter two equalities holding µ-almost surely by definition of the Radon-Nikodym derivative. Thus we obtain for ν = µ 1 or ν = µ that dp1 /dν f dp k /dν,..., dp /dν dpk dp k /dν dν dν = dp1 /dν f dp k /dν,..., dp /dν dpk dν dp k /dν dν dµ dµ dp1 /dµ = f dp k /dµ,..., dp /dµ dpk dp k /dµ dµ dµ by definition of the Radon-Nikodym derivative. Moreover, we see that p i p k which shows that the base measure µ does not affect the integral. = dp i/dµ dp k /dµ a.s.-µ, For proving the lemma., we need some preliminaries definitions and lemmas. Given a measurable partition P of X, define D f P 1,..., P P k P = P1 A f P k A,..., P A P k A, P k A A P In addition, given a sub-σ-algebra G F, we let P G denote the restriction of the measure P, defined on F, to G. Now we have the following preposition Proposition A.1. Let F 1 F... be a sequence of sub-σ-algebras of F and let F = σ n 1 F n. Then D f P F 1 1,..., P F 1 P F 1 k Df P F 1,..., P F P F k Df P F 1,..., P F F Pk, and moreover, lim D f P F n 1,..., P Fn n Fn Pk = Df P F 1,..., P F F Pk. Proof. Define the measure ν = 1 k k P i and vectors V n via the Radon-Nikodym derivatives V n = 1 k dp Fn 1 dν, dp Fn Fn dν,..., dp Fn. Fn dν Fn Then 1 1 T V n dν Fn = 1 Fn dp k k, and V n is a martingale sequence adapted to the filtration F n by standard properties of conditional expectation. Letting the set C k = {v R + 1 T v 1}, we define g : C k R by v gv = f 1 1 T v. 1 1 T v 7

8 Then g is convex it is a perspective function, and we have E[gV n dp Fn Fn 1 dp ] = f,..., 1 1 dp Fn i dν = 1 ν dp Fn k dp Fn k dν k Fn k D f P F n 1,..., P Fn Fn Pk. Because V n C k for all n, we see that gv n is a submartingale. This gives the first result of the proposition, that is, that the sequence D f P F n 1,..., P Fn Fn Pk is non-decreasing in n. Now, assume that the limit in the second statement is finite, as otherwise the result is trivial. Then using that f1 = 0, we have by convexity that for any v C k, v gv = f 1 1 T v + f11 T v f v + 11 T v inf fv + 11 T v >, 1 1 T v v C k the final inequality a consequence of the fact that f is closed and hence attains its infimum. In particular, the sequence gv n inf v Ck gv is a non-negative submartingale, and thus ] [ ] sup E [ gv n inf gv = limn Eν gv n inf gv <. n ν v C k v C k Coupled with this integrability guarantee, Doob s second martingale convergence theorem thus yields the existence of a random vector V F such that ] 0 lim n E ν [gv n inf v C k gv = E ν [gv ] inf v C k gv <. Because inf v Ck gv >, we have E ν [ gv ] <, giving the result of the proposition. Proof of Lemma. Let the the base measure µ = 1 k k P i and let p i = dp i be the associated densities dµ of the P i. Define the increasing sequence of partitions P n of X by sets A αn for vectors α = α 1,..., α and B with { A αn = x X α j 1 p jx n p k x < α } j for j = 1,..., k 1 n where we let each α j range over { n n, n n + 1,..., n n }, and define and B = α A αn c = X \ α A αn. Then we have p1 x p k x,..., p x α = lim p k x n χ n A αn x + n,..., nχ B x, α { n n,...,n n } each term of which on the right-hand-side is F n -measurable, where F n denotes the sub-σ-field generated by the partition P n. Defining F = σ n 1 F n, we have D f P 1,..., P P k P n = D f P F n 1,..., P Fn Fn Pk n D f P F 1,..., P F 8 P F k = Df P 1,..., P P k,

9 where the limiting operation follows by Proposition A.1 and the final equality because of the containment p 1 p k,..., p p k F. Proof of Lemma.3 Recall that the perspective function gy, t defined by gy, t = tfy/t for t > 0 is jointly convex in y and t. The measure ν = µ/µx defines a probability measure, so that for X ν, Jensen s inequality implies adν f b1 dν adν,..., bk dν E [axf adν ν = 1 µx Noting that b i dν/ adν = b i dµ/ adµ gives the result. b1 X ax,..., b ] kx ax b1 x axf ax,..., b kx ax dµx. For proving the main theorem we need the following definition Definition Let f 1 : R K + R and f : R K + R be continuous convex functions. Let m N be arbitrary and the nonnegative sequences a ij 0, b ij 0, i = 1,..., K and j = 1,..., m, satisfy m a ij = m b ij for each i = 1,... K. Then we say that f 1 and f are order-equivalent if for all m N and all such sequences a ij and b ij if and only if f 1 a 1j, a j,..., a Kj f a 1j, a j,..., a Kj f 1 b 1j, b j,..., b Kj f b 1j, b j,..., b Kj. Lemma A.. If for some nonnegative sequence a k, b k, 1 k M such that M k=1 a k = M k=1 b k, then there exists nonnegative numbers z i,j for 1 i, j M such that M z ij = a j M z ij = b j for 1 j M. Proof of Lemma The proof is based on induction. When M = 1, this is simple. Now suppose the statement is true for all M 1, we are going to argue that the statement also holds for M. Without loss of generality, we may assume that a M = min{a j 1 j M} and b M = min{b j 1 j M} Otherwise, just reorder both sequence.. Now, W.L.O.G, let s assume a M b M. The case for b M a M is similar.. Then we pick z M,M = a M and z i,m = 0 for all 1 i M 1. Now assume that a j = max{a j 1 j M 1}, then we have a j b M since a j 1 M M k=1 a k = 1 M M k=1 b k b M. Then, let z j,m = a j b M and z i,m = 0 for all 1 i K 1 and i j. Now, if we denote ã j = a j z j,m, b j = b j z M,j for 1 j M 1, then we have 9

10 M 1 ãj = M 1 b j. By induction hypotheses, there exists nonnegative numbers z i,j for 1 i, j M 1 such that M 1 z ij = a j M 1 z ij = b j for 1 j M 1. It is easy to check now that the z i,j we constructed satisfy the requirement M z ij = a j, M z ij = b j for 1 j M. Lemma A.3. If two loss functions Φ 1 and Φ are universally equivalent, then their corresponding f-divergences, denoted as f 1 and f, are order equivalent. Proof of Lemma Given the nonnegative sequences a ij 0, b ij 0, i = 1,..., K 1 and j = 1,..., m such that m a ij = m b ij for each i = 1,... K 1, we may find some postive integer M m such that M > m a ij for all i = 1,..., K 1. Denote a i,j+1 = M m a ij and b i,j+1 = M m b ij, for i = 1,..., K 1, and a i,l = b i,l = 0 for all j + 1 < l M. Hence, we have a i,l 0, b i,l 0 for all 1 i K 1, 1 l M and M l=1 a i,l = M l=1 b i,l = M. Now, take X as {i, j 1 i M, 1 j M}. Let Z = {1,,..., M}. Denote q 1 as the quantizer that maps i, j X to i Z and q as the quantizer that maps i, j X to j Z. Now, lemma A. shows that there exists some zij k for 1 i M, 1 j M, 1 k K such that M zk ij = a kj /M and M zk ij = b kj /M. Denote P i are the probability measure that P k {i, j} = zi,j k for 1 i M, 1 j M, 1 k K. Now, under this quantizer design, we have by order equivalence of Φ 1 and Φ, if and only if 1 M 1 M M M f 1 a1j,..., a K 1j 1 M f a1j,..., a K 1j 1 M M f 1 b1j,..., b K 1j M f b1j,..., b K 1j where f 1 and f are the associated f-divergence. Note that, since a ij = b ij for all 1 i K 1 and j m. Therefore, we have m f 1 a1j,..., a K 1j f 1 b1j,..., b K 1j if and only if m f a1j,..., a K 1j f b1j,..., b K 1j Since this is for all nonnegative sequence of a ij 0, b ij 0, i = 1,..., K 1 and j = 1,..., m such that m a ij = m b ij for each i = 1,... K 1, f 1 and f are thus order equivalent. We give one technical lemma that will be useful throughout our derivation. 10

11 Lemma A.4 Karamata s majorization inequality. Let f : R R be a convex function. If x 1 x x n, y 1 y y n, and x x k y y k for k {1,..., n}, then fx fx n fy fy n. Lemma A.5. Let f 1 : R + R and f : R + R be order equivalent. If further f 1 i = f i for i = 0, 1,, then we have f 1 x = f x for all x [0, ]. Proof. Let S = {x 0 f 1 x = f x} denote the set of equal points of f 1 and f. The lemma will follow once we show the following three claims: a First, we show that if 0 S, t S, t S for some t > 0, then t/ S. b If 0 S, t S, t S, then 3t/ S c Based on these two observations, an inductive argument guarantees that [0, ] S. We begin with claim a. For q [0, 1] and i {1, }, define the functions t φ i q = qf i t + f i q + 1f i t 1 qf i 0. We claim that signφ 1 q = signφ q for all q [0, 1]. 7 Indeed, let q = r for some r, s N with s r. Then s t s φ i q = rf i t + sf i r + sf i t s rf i 0 t t = f i t + + f i t + f }{{} i + + f i f i t + + f i t f }{{} i f i 0. }{{} r times }{{} r+s times s r times s times Hence, we have φ i q 0 if and only if t t f i t + + f i t + f }{{} i + + f i f i t + + f i t + f }{{} i f i 0. }{{} r times }{{} r+s times s r times s times Using that the f i are ordering equivalent, we set m = s + r and note that rt + s t = r + st + s r 0. Using the definition of order equivalence, we reach the conclusion 7 11

12 for the rationals Q, that is, that signφ 1 q = signφ q for q Q [0, 1]. To extend the result to [0, 1], we use the continuity of the φ i and the fact that clq [0, 1] = [0, 1], which yields the desired result 7. By the convexity of f i, Karamata s majorization inequality Lemma A.4 implies that t t f i t + f i 3f i t and f i f i t + f i 0. That is, we have 1 1 φ i 1 = f i t + f i t 3f i t 0 and φ i 0 = f i t f i 0 f i t 0. Using the intermediate value theorem, there exists r [0, 1] such that φ 1 r = 0, and as signφ 1 = signφ, we have φ 1 r = φ r = 0. Rewriting this using the definition of φ 1, φ, we have t 0 = φ i r = rf i t + f i r + 1f i t 1 rf i 0, which we rewrite as t f i = 1 r + 1f it + 1 rf i 0 rf i t. As f 1 x = f x for x {0, t, t}, the above equation guarantees that f 1 t/ = f t/, completing our proof of claim a. A similar technique yields our claim b, that is, that if 0, t, t S, then 3t/ S. Consider now functions φ i : [0, 1] R defined by φ i q = f i 3t + q f i0 q + 1 f i t 1 q f it As in our argument for claim a, we have signφ 1 = signφ on Q [0, 1]: if we assume q = r for some s, r N with s r, then φ s iq 0 if and only if 3t 3t f i + + f i + q f i0 + + q f i0 f i t + + f i t + ft +... ft. }{{}}{{}}{{}}{{} r+s times s r times s times r times Using the order-equivalence of f 1 and f by taking m = s + r note that s 3t + r 0 = r + st + s rt, we see that signφ 1 q = signφ q for q Q [0, 1]. The obvious extension to [0, 1] follows, so that signφ 1 = signφ on [0, 1]. Now, note that by convexity of the f i, Karamata s majorization inequality Lemma A.4 implies φ i 0 = 1 3 f i t f i t f i t 0 and φ i 1 = 1 3 f i t + f i 0 3f i t 0. 1

13 The intermediate value theorem implies there exists r [0, 1] such that φ i r = 0 for i = 1,, as signφ 1 = signφ, whence 3t φ i r = f i + r f i0 r + 1 f i t 1 r f it = 0. This immediately implies f i 3t = r + 1 f i t + 1 r f it r f i0. As f 1 x = f x for x = 0, t, t, the above formula implies f 1 x = f x also at x = 3t/, yielding our claim b. The final part of our argument to note that claims a and b can be generalized to any three numbers {s, s+t, s+t} [0, ]. That is, if {s, s+t, s+t} S, then s+ 1 t and s+ 3 t belong to S. This follows when we define the shifted functions f i x = f i s + x, noting that f 1 and f are order equivalent Def. A. Applying claims a and b at 0, t, t to the f i, we see that f 1 t/ = f t/ and f 1 3t/ = f 3t/. An inductive argument immediately implies that all the dyadic numbers, those rationals of the form q = m/ n, m = 0,..., n+1, belong to the set S. As f 1 and f are continuous, we have that S = {x 0 f 1 x = f x} is closed, and by the denseness of the dyadic numbers, it must be the case that S [0, ]. One may use the argument we have shown above for proof of the last observation to get the following corollary, Corollary A.6. Let f 1 : R + R and f : R + R be order equivalent. If further f 1 a + ib = f a + ib for i = 0, 1,, then we have f 1 x = f x for all x [a, a + b]. Lemma A.7. Let f 1, f : R + R be order equivalent. If f 1 i = f i for i = 0, 1,, and further f 1 1 f 1 + f 1 0, then f 1 t = f t for all x R Proof. Convexity of f 1 together with the given condition implies f 1 1 f 1 f 1 0 < 0. Since f 1 and f are order equivalent, we obtain f 1 f f 0 < 0. Let n 3 be an arbitrary integer and consider functions φ i : [0, 1] R for i {1, }, defined by: φ i q = qf i n nqf i 0 n + q f i 1 1 qf i. We claim that 7 still holds. Indeed, first for rational numbers such as q = r, where r, s N s with s r we have s φ i q = rf i n + r + nsf i 0 n + r sf i 1 s rf i = f i n + + f i n + f }{{} i f i 0 f }{{} i f i 1 f }{{} i + + f i. }{{} r times ns r times n+r s times s r times 13

14 Hence, we have φ i q 0 if and only if f i n + + f i n + f }{{} i f i 0 f }{{} i f i 1 + f }{{} i + + f i. }{{} r times ns r times n+r s times s r times If ns r < 0 we can move the term containing f0 to the right hand side. Likewise, if n+r s < 0 we can move the term containing f1 to the left hand side. We set m = ns = r+ns r = n+r s+s r and note that n.r+0.ns r = n+r s.1+s r.. Since f 1 and f are order equivalent, φ 1 and φ satisfy 7 for q Q [0, 1]. Using the the continuity of φ 1 and φ we can extend this result to all numbers in [0, 1], which proves the claim. By convexity of f i, Karamata s Majorization Inequality Lemma A.4 implies that: φ i 1 = f i n + n 1f i 0 nf i 1 0 and φ i 0 = f i 1 f i f i 0 < 0. Using the intermediate value theorem, there exists u 0, 1] such that φ 1 u = 0 and since signφ 1 =signφ we obtain φ u = 0. We can rewrite this as: φ i u = uf i n nuf i 0 n + u f i 1 1 uf i = 0, which can also be written as: f i n = 1 u n + u f i1 + 1 uf i nu 1f i 0. This shows that f 1 n = f n. Since n was chosen arbitrarily, we obtain f 1 n = f n for all integers n 0. For completing the proof we only need to use Corollary A.6. Precisely, for any t R, pick a and b as integers with the condition a < t < a+b. Since, f 1 a+ib = f a+ib for i {0, 1, }, f 1 t = f t according to the Corollary A.6. Now, we have the tools required for proving the main result. Here we only show the difficult direction for K = 1. Proof of Theorem 4.1 One way is easy to prove. If, f 1 t = af t + b T t + c then by computing the f 1 and f -divergences we get: D f1 P 1, P,..., P P k q = ad f P 1, P,..., P P k q Since a > 0 this means that Φ 1 and Φ induce the same ordering on quantizers, as required. For proving the other way assuming K = 1, using the A.3 we observe that f 1 and f, are order equivalent. Denote the set T i = {t R + f i t + f i 0 f i t}, for i {1, }. We have two different cases: a T 1. Pick any t T 1. Note that t T, according to the ordering equivalence of f 1 and f. Define f i t = f i t/t for i {1, }. It is easy to check that f 1 t and f t are order equivalent. Choose numbers a, b, and c as: a = f f 1 f 1 1 f 0 + f f 1, b = f 1 0 a f 0, c = f 1 1 f

15 Construct a new function f t = a f t + ct + b. According to the choice of a, b, and, c, we have f 1 t = f t for t {0, 1, }. Furthermore, f 1 1 f 1 + f 1 0, since t T 1. Lemma A.7 implies that f 1 t = f t for all t R. That is, f 1 t/t = af t/t + ct + b for all t R. Since, t > 0 is fixed we obtain f 1 t = af t + ct t + b, which implies the result in this case. b T 1 =. Then for all t R +, f 1 t + f 1 0 = f 1 t. This also implies that f t + f 0 = f t for all t R +, regarding to the ordering equivalence. Using the convexity of f i, we get that f i is an affine function on R. Therefore, there exists b, c R such that f 1 t = f t + bt + c, as desired. References [LLW04] Yoonkyung Lee, Yi Lin, and Grace Wahba. Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99465:67 81, 004. [LV06] F. Liese and I. Vajda. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 510: , 006. [NWJ09] Xuanlong Nguyen, Martin J Wainwright, and Michael I Jordan. On surrogate loss functions and f-divergences. The Annals of Statistics, pages , 009. [TB07] Ambuj Tewari and Peter L Bartlett. On the consistency of multiclass classification methods. The Journal of Machine Learning Research, 8: , 007. [Tsi93] John N. Tsitsiklis. Decentralized detection. In Advances in Signal Processing, Vol., pages JAI Press, [Zha04] Tong Zhang. Statistical analysis of some multi-category large margin classification methods. The Journal of Machine Learning Research, 5:15 151,

Stanford Statistics 311/Electrical Engineering 377

Stanford Statistics 311/Electrical Engineering 377 I. Bayes risk in classification problems a. Recall definition (1.2.3) of f-divergence between two distributions P and Q as ( ) p(x) D f (P Q) : q(x)f dx, q(x) where f : R + R is a convex function satisfying