Convexity, Detection, and Generalized f-divergences

Size: px
Start display at page:

Download "Convexity, Detection, and Generalized f-divergences"

Transcription

1 Convexity, Detection, and Generalized f-divergences Khashayar Khosravi Feng Ruan John Duchi June 5, Introduction The goal of classification problem is to learn a discriminant function for classification of forthcoming data. Given data coming in i.i.d. pairs X i, Y i, for 1 i n, from some underlying distributions X X, and Y Y = {1,, 3,..., k}, we aim to design a classifier, γ, with smallest 0-1 loss, given by PγX Y. However, minimizing 0-1 loss directly is often computationally intractable and practical algorithms are based on convex relaxations of 0-1 loss, say Φ, which is called the surrogate loss. It is natural to ask whether solving the minimization problem over Φ leads to a minimizer of 0-1 loss; this is called consistency. In contrast to the binary classification, in the multi-class classification problem analysis of consistency of Φ requires more sophisticated approaches. The prior Bayes Φ-risk is defined as the optimal achievable risk by having access only to probability of labels. Likewise, the posterior Bayes Φ-risk is the optimal achievable risk once we further have access to posterior densities, given by p c x = px Y = c, for 1 c k. The difference is an important quantity as it indicates the amount of information that X has. In the binary classification problem, this gap was shown in [LV06] to be equal to an f-divergence between probability distributions P 1 and P, for a convex f : R R. In many applications, the covariates used for solving the minimization problem, are either received after passing through a dimension reducing quantizer, or are deliberately transformed using a feature selection stage in order to obtain a more meaningful interpretation. Decentralized detection problem, as described in [Tsi93], is an example of the first situation where we collect information from sensors located in different places and we aim to solve a hypothesis testing problem at a fusion center. Regarding to constraints on sensors power and the communication channels, we need to send quantized version of collected information to the fusion center. In experimental design problems, we often encounter the second situation, since we discard some information that are useless in our analysis. For modeling purposes, let q : X Z = {1,,..., m} be a quantizer. Our aim is jointly learning a classifier γ : Z Y and a quantizer q, by minimizing the misclassification rate, given by PγqX Y. Considering any convex surrogate loss Φ, we can find the optimal risk of quantizer q, denoted as R Φ q, by empirically minimizing E[Φ Y γqx] over γ. One fundamental question is finding all surrogate losses Φ that are universally equivalent to the 0-1 loss functions, in the sense that, for all underlying probability distributions on X Y, 1

2 the Φ loss and 0-1 loss induce the same ordering on the optimal risk of quantizers. By doing so, we are able to identify all convex surrogate losses that not only classification problem for them is easy to solve, but they also give us the same optimal classifier γ and quantizer q as the 0-1 loss, regardless of underlying distribution. For binary case, this problem can be written in terms of corresponding f-divergences and is studied in [NWJ09]. The main goal of this report is to define a generalized version of f-divergences, study its properties, and prove a general theorem for characterizing universally equivalent losses in the multi-class setting in terms of generalized f-divergences. We will also show that 0-1 loss is equivalent to the some multi-dimensional hinge loss that is widely-used in practice. Generalized f-divergences Let P 1,..., P k be probability distributions on a common σ-algebra F over a set X. Let f : R + R be a convex function satisfying f1 = 0. Let µ be any measure such that P i µ for all i, and let p i = dp i /dµ denote the density of P i with respect to µ. Then we define the generalized f-divergence between P 1,..., P k as p1 x D f P 1,..., P P k := f p k x,..., p x p k xdµx. 1 p k x Lemma.1. The value of generalized f-divergence, defined in 1, does not depend on the base measure µ. We now illustrate a few properties of these multi-way f-divergences, showing how they extend classical divergences. Given a a measurable mapping q : X N with a finite range, we define D f P 1,..., P P k q = P1 A f P k A,..., P A P k A. P k A A q 1 N This is the quantized version of the generalized f-divergences. When the function f is well-behaved, by choosing the quantizer wisely, we will not lose any information. Lemma.. Let f be a closed convex function. Then D f P 1,..., P P k = sup D f P 1,..., P P k q, q where the supremum is over quantizers of X. For proving one of the most important properties of f-divergences, i.e., the data processing inequality, we first need the following lemma: Lemma.3. Let f : R k + R be convex. Let a : X R + and b i : X R +, for 1 c k, be nonnegative measurable functions. Then for any finite measure µ defined on X, we have b1 dµ bk dµ b1 x adµ f,..., axf adµ adµ ax,..., b kx dµx. ax

3 Proposition.4. Let f : R k + R be a convex function such that f1 = 0. Let Q be a Markov kernel from X to Z, that is, Q x is a probability distribution on Z for each x X. Define the marginals Q Pi A := X QA xdp ix for each i = 1,..., k. Then Proof. By Lemma., we have that D f QP1,..., Q P Q Pk Df P 1,..., P P k. D f QP1,..., Q P Q Pk = sup { Df QP1,..., Q P Q Pk q : q is a quantizer of Z }. It is consequently no loss of generality to assume that Z is finite and Z = {1,,..., m}. Let µ = k P c be a dominating measure and let p c = dp c /dµ. Then letting q Pc i = Q Pc {i}, we obtain m qp1 i D f QP1,..., Q P Q Pk = q Pk if q Pk i,..., q P i q Pk i = X X qi xp k xdµx f qi xf p1 x p k x,..., p x p k x qi xp 1xdµx X qi xp X kxdµx,..., p k xdµx. qi xp X xdµx qi xp X kxdµx by Lemma.3. Noting that m qi x = 1, we thus obtain our desired result. 3 Correspondence between risks and f-divergences In order to state the correspondence between risks and f-divergences, let us introduce some notations. Denote π c = PY = c as the prior distribution on Y and p c x = px Y = c as the posterior density, for 1 c k. Each quantization method q, induces a conditional probability distribution Q c z = PY = c qx = z on Z, for almost all y. Let Ψ c α be the loss associated with Y = c, where α R k, with the condition 1 T α = 0, is the decision vector. For each discriminant function γ : Z R k, the risk, defined as the expectation of loss, is equal to R Φ γ q = E[Φ Y γqx] = = = [Φ Y γi qx = i PqX = i] [ k Φ c γi PY = c qx = i ] PqX = i [ k π c Φ c γip c qx = i ]. 3

4 By taking the infimum over γ, we get the quantized posterior Bayes Φ-risk: R Φ q = inf γ R Φγ q = inf α T 1=0 k π c Φ c αp c A i, where A i = {x X qx = i} = q 1 i. Now we have the following proposition Proposition 3.1. For any loss function Φ = Φ 1, Φ,..., Φ k and any quantization rule q : X {1,,..., m}, define quantized Φ-statistical information as the gap between the prior Bayes Φ-risk and the posterior Bayes Φ-risk. Then, Φ-statistical information can be written as a generalized f-divergence. That is I Φ q = R Φ R Φ q = D fφ P 1, P,..., P P k q, for some closed convex function f Φ. Further, f Φ is given by { f Φ t 1, t,..., t = sup R Φ π k Φ k α π c Φ c αt c }. 3 1 T α=0 Proof. The proof is easy. By expanding the Φ-statistical information I Φ q = R Φ R Φ q = = sup α T 1=0 R Φ sup α T 1=0 k R Φ k π c Φ c αp c A i π c Φ c α P ca i P k A i = D fφ P 1, P,..., P P k q. P k A i It is easy to check that 3 satisfies above equality and hence we are done. 4 Universal equivalent class characterization As discussed in section 1, it is interesting to study whether there exist two surrogate loss functions that induce the same ordering on the optimal risk of quantizers. More precisely: Definition Two surrogate loss functions Φ 1 and Φ are universally equivalent if for any probability distribution PX, Y on X Y, and any two quantization rules q 1 and q on X, there holds R Φ1 q 1 R Φ1 q R Φ q 1 R Φ q. It is easy to translate this definition in terms of f 1 and f which are f-divergences associated to Φ 1 and Φ, respectively. This is true according to the Proposition

5 Remark Let f 1 and f be f-divergences associated with Φ 1 and Φ respectively. Then Φ 1 and Φ are universally equivalent if for any probability distribution PX, Y on X Y, and any two quantization rules q 1 and q on X, we have D f1 P 1, P,..., P P k q 1 D f1 P 1, P,..., P P k q D f P 1, P,..., P P k q 1 D f P 1, P,..., P P k q, where for 1 c k, P c is the conditional distribution of X given Y = c. Observe that this definition is very stringent, in that it requires that the ordering between optimal Φ 1 and Φ risks holds for all probability distributions P on X Y. The following theorem provides necessary and sufficient conditions under which universal equivalency holds. Theorem 4.1. Let f 1, f : R K R be closed convex f-divergences associated with Φ 1 and Φ, respectively. Then Φ 1 and Φ are universally equivalent if and only if f 1 t = af t + b T t + c for some constants a > 0, b R K, c R. Due to space constraints, the proof of the theorem is left to the appendix. This theorem is the main result of paper and we now discuss how using this theorem can help us to solve minimization of 0-1 loss problem. In the next section we compute the f-divergences associated with 0-1 and hinge loss to confirm their equivalence. 5 Universal equivalence of 0-1 and hinge loss Example First consider the 0-1 loss, which is defined as the probability of misclassification. That is, the loss is zero if we choose the correct label and is 1 otherwise. This loss can be expressed in many ways as Φ 0 1 α = Φ α,..., Φ 0 1 α, one such is Φ 0 1 c α = 1 Iα c > α j for 1 j c 1 and α c α j for c + 1 j k. This definition ensures that exactly one of Φ 0 1 c, for 1 c k is zero and the others are one. The prior Bayes risk is R Φ 0 1 = inf α T 1=0 k k π c Φc 0 1 α = 1 max π c, 1 c k While the posterior Bayes risk in the presence of the quantizer q is k R Φ 0 1q = π c P c A i max π c P c A i = 1 max π c P c A i. c c Thus, the f-divergence associated to 0-1 loss is equal to { } f Φ 0 1t = max max t iπ i, π k 1 i max 1 i k π i. 4 5

6 Example Denote φ hinge as the hinge loss, defined by φ hinge α = 1 α + = max{1 α, 0}. There are many ways for expanding this loss function to the multi-dimensional setting. One general model for this purpose, introduced in [LLW04], is Φ c α = j c φ α j, 5 for 1 c k. We also impose the constraint α T 1 = 0 in 5. This definition was shown in [Zha04, TB07], to be consistent whenever φ is convex, decreasing, and φ 0 < 0. We here also adopt this model and for 1 c k, define Φ hinge c α = j c φhinge α j. By computing prior Bayes Φ hinge -risk we get k k R Φ hinge = π c 1 + α j + = inf 1 π c 1 + α c +. 6 inf α T 1=0 j c α T 1=0 This convex optimization problem can be solved using KKT conditions. Define α as { αi k 1 if i = M, = 1 if i M. It is easy to check α is a minimizer of infimum in 6. Therefore, The posterior risk can also be calculated as R Ψ hingeq = k R Ψ hinge = k1 max 1 c k π c. 1 k max π cp c A i. 1 c k Hence, the f-divergence corresponding to the hinge loss is { } f Φ hinget = k max max t iπ i, π k max π i 1 i 1 i k = kf π,ψ 0 1t, where the last equality comes from 4. According to Theorem 4.1 we observe that 0-1 and the hinge loss are universally equivalent. More generally, using 4 we can state the following corollary of Theorem 4.1: Corollary 5.1. All loss functions Φ that are universally equivalent to the 0 1 loss, induce f-divergences of the form { } f Φ t = a max max t iπ i, π k max π i + b T t + c 1 i 1 i k for some constants a > 0, b R, c R. 6

7 Appendix A Proof of Lemma.1 Let µ 1 and µ be dominating measures; then µ = µ 1 + µ also dominates P 1,..., P k as well as µ 1 and µ. We have for ν = µ 1 or ν = µ that dp i /dν dp k /dν = dp i/dν dν/dµ dp k /dν dν/dµ = dp i/dµ dp k /dµ and dp i dν dν dµ = dp i dµ, the latter two equalities holding µ-almost surely by definition of the Radon-Nikodym derivative. Thus we obtain for ν = µ 1 or ν = µ that dp1 /dν f dp k /dν,..., dp /dν dpk dp k /dν dν dν = dp1 /dν f dp k /dν,..., dp /dν dpk dν dp k /dν dν dµ dµ dp1 /dµ = f dp k /dµ,..., dp /dµ dpk dp k /dµ dµ dµ by definition of the Radon-Nikodym derivative. Moreover, we see that p i p k which shows that the base measure µ does not affect the integral. = dp i/dµ dp k /dµ a.s.-µ, For proving the lemma., we need some preliminaries definitions and lemmas. Given a measurable partition P of X, define D f P 1,..., P P k P = P1 A f P k A,..., P A P k A, P k A A P In addition, given a sub-σ-algebra G F, we let P G denote the restriction of the measure P, defined on F, to G. Now we have the following preposition Proposition A.1. Let F 1 F... be a sequence of sub-σ-algebras of F and let F = σ n 1 F n. Then D f P F 1 1,..., P F 1 P F 1 k Df P F 1,..., P F P F k Df P F 1,..., P F F Pk, and moreover, lim D f P F n 1,..., P Fn n Fn Pk = Df P F 1,..., P F F Pk. Proof. Define the measure ν = 1 k k P i and vectors V n via the Radon-Nikodym derivatives V n = 1 k dp Fn 1 dν, dp Fn Fn dν,..., dp Fn. Fn dν Fn Then 1 1 T V n dν Fn = 1 Fn dp k k, and V n is a martingale sequence adapted to the filtration F n by standard properties of conditional expectation. Letting the set C k = {v R + 1 T v 1}, we define g : C k R by v gv = f 1 1 T v. 1 1 T v 7

8 Then g is convex it is a perspective function, and we have E[gV n dp Fn Fn 1 dp ] = f,..., 1 1 dp Fn i dν = 1 ν dp Fn k dp Fn k dν k Fn k D f P F n 1,..., P Fn Fn Pk. Because V n C k for all n, we see that gv n is a submartingale. This gives the first result of the proposition, that is, that the sequence D f P F n 1,..., P Fn Fn Pk is non-decreasing in n. Now, assume that the limit in the second statement is finite, as otherwise the result is trivial. Then using that f1 = 0, we have by convexity that for any v C k, v gv = f 1 1 T v + f11 T v f v + 11 T v inf fv + 11 T v >, 1 1 T v v C k the final inequality a consequence of the fact that f is closed and hence attains its infimum. In particular, the sequence gv n inf v Ck gv is a non-negative submartingale, and thus ] [ ] sup E [ gv n inf gv = limn Eν gv n inf gv <. n ν v C k v C k Coupled with this integrability guarantee, Doob s second martingale convergence theorem thus yields the existence of a random vector V F such that ] 0 lim n E ν [gv n inf v C k gv = E ν [gv ] inf v C k gv <. Because inf v Ck gv >, we have E ν [ gv ] <, giving the result of the proposition. Proof of Lemma. Let the the base measure µ = 1 k k P i and let p i = dp i be the associated densities dµ of the P i. Define the increasing sequence of partitions P n of X by sets A αn for vectors α = α 1,..., α and B with { A αn = x X α j 1 p jx n p k x < α } j for j = 1,..., k 1 n where we let each α j range over { n n, n n + 1,..., n n }, and define and B = α A αn c = X \ α A αn. Then we have p1 x p k x,..., p x α = lim p k x n χ n A αn x + n,..., nχ B x, α { n n,...,n n } each term of which on the right-hand-side is F n -measurable, where F n denotes the sub-σ-field generated by the partition P n. Defining F = σ n 1 F n, we have D f P 1,..., P P k P n = D f P F n 1,..., P Fn Fn Pk n D f P F 1,..., P F 8 P F k = Df P 1,..., P P k,

9 where the limiting operation follows by Proposition A.1 and the final equality because of the containment p 1 p k,..., p p k F. Proof of Lemma.3 Recall that the perspective function gy, t defined by gy, t = tfy/t for t > 0 is jointly convex in y and t. The measure ν = µ/µx defines a probability measure, so that for X ν, Jensen s inequality implies adν f b1 dν adν,..., bk dν E [axf adν ν = 1 µx Noting that b i dν/ adν = b i dµ/ adµ gives the result. b1 X ax,..., b ] kx ax b1 x axf ax,..., b kx ax dµx. For proving the main theorem we need the following definition Definition Let f 1 : R K + R and f : R K + R be continuous convex functions. Let m N be arbitrary and the nonnegative sequences a ij 0, b ij 0, i = 1,..., K and j = 1,..., m, satisfy m a ij = m b ij for each i = 1,... K. Then we say that f 1 and f are order-equivalent if for all m N and all such sequences a ij and b ij if and only if f 1 a 1j, a j,..., a Kj f a 1j, a j,..., a Kj f 1 b 1j, b j,..., b Kj f b 1j, b j,..., b Kj. Lemma A.. If for some nonnegative sequence a k, b k, 1 k M such that M k=1 a k = M k=1 b k, then there exists nonnegative numbers z i,j for 1 i, j M such that M z ij = a j M z ij = b j for 1 j M. Proof of Lemma The proof is based on induction. When M = 1, this is simple. Now suppose the statement is true for all M 1, we are going to argue that the statement also holds for M. Without loss of generality, we may assume that a M = min{a j 1 j M} and b M = min{b j 1 j M} Otherwise, just reorder both sequence.. Now, W.L.O.G, let s assume a M b M. The case for b M a M is similar.. Then we pick z M,M = a M and z i,m = 0 for all 1 i M 1. Now assume that a j = max{a j 1 j M 1}, then we have a j b M since a j 1 M M k=1 a k = 1 M M k=1 b k b M. Then, let z j,m = a j b M and z i,m = 0 for all 1 i K 1 and i j. Now, if we denote ã j = a j z j,m, b j = b j z M,j for 1 j M 1, then we have 9

10 M 1 ãj = M 1 b j. By induction hypotheses, there exists nonnegative numbers z i,j for 1 i, j M 1 such that M 1 z ij = a j M 1 z ij = b j for 1 j M 1. It is easy to check now that the z i,j we constructed satisfy the requirement M z ij = a j, M z ij = b j for 1 j M. Lemma A.3. If two loss functions Φ 1 and Φ are universally equivalent, then their corresponding f-divergences, denoted as f 1 and f, are order equivalent. Proof of Lemma Given the nonnegative sequences a ij 0, b ij 0, i = 1,..., K 1 and j = 1,..., m such that m a ij = m b ij for each i = 1,... K 1, we may find some postive integer M m such that M > m a ij for all i = 1,..., K 1. Denote a i,j+1 = M m a ij and b i,j+1 = M m b ij, for i = 1,..., K 1, and a i,l = b i,l = 0 for all j + 1 < l M. Hence, we have a i,l 0, b i,l 0 for all 1 i K 1, 1 l M and M l=1 a i,l = M l=1 b i,l = M. Now, take X as {i, j 1 i M, 1 j M}. Let Z = {1,,..., M}. Denote q 1 as the quantizer that maps i, j X to i Z and q as the quantizer that maps i, j X to j Z. Now, lemma A. shows that there exists some zij k for 1 i M, 1 j M, 1 k K such that M zk ij = a kj /M and M zk ij = b kj /M. Denote P i are the probability measure that P k {i, j} = zi,j k for 1 i M, 1 j M, 1 k K. Now, under this quantizer design, we have by order equivalence of Φ 1 and Φ, if and only if 1 M 1 M M M f 1 a1j,..., a K 1j 1 M f a1j,..., a K 1j 1 M M f 1 b1j,..., b K 1j M f b1j,..., b K 1j where f 1 and f are the associated f-divergence. Note that, since a ij = b ij for all 1 i K 1 and j m. Therefore, we have m f 1 a1j,..., a K 1j f 1 b1j,..., b K 1j if and only if m f a1j,..., a K 1j f b1j,..., b K 1j Since this is for all nonnegative sequence of a ij 0, b ij 0, i = 1,..., K 1 and j = 1,..., m such that m a ij = m b ij for each i = 1,... K 1, f 1 and f are thus order equivalent. We give one technical lemma that will be useful throughout our derivation. 10

11 Lemma A.4 Karamata s majorization inequality. Let f : R R be a convex function. If x 1 x x n, y 1 y y n, and x x k y y k for k {1,..., n}, then fx fx n fy fy n. Lemma A.5. Let f 1 : R + R and f : R + R be order equivalent. If further f 1 i = f i for i = 0, 1,, then we have f 1 x = f x for all x [0, ]. Proof. Let S = {x 0 f 1 x = f x} denote the set of equal points of f 1 and f. The lemma will follow once we show the following three claims: a First, we show that if 0 S, t S, t S for some t > 0, then t/ S. b If 0 S, t S, t S, then 3t/ S c Based on these two observations, an inductive argument guarantees that [0, ] S. We begin with claim a. For q [0, 1] and i {1, }, define the functions t φ i q = qf i t + f i q + 1f i t 1 qf i 0. We claim that signφ 1 q = signφ q for all q [0, 1]. 7 Indeed, let q = r for some r, s N with s r. Then s t s φ i q = rf i t + sf i r + sf i t s rf i 0 t t = f i t + + f i t + f }{{} i + + f i f i t + + f i t f }{{} i f i 0. }{{} r times }{{} r+s times s r times s times Hence, we have φ i q 0 if and only if t t f i t + + f i t + f }{{} i + + f i f i t + + f i t + f }{{} i f i 0. }{{} r times }{{} r+s times s r times s times Using that the f i are ordering equivalent, we set m = s + r and note that rt + s t = r + st + s r 0. Using the definition of order equivalence, we reach the conclusion 7 11

12 for the rationals Q, that is, that signφ 1 q = signφ q for q Q [0, 1]. To extend the result to [0, 1], we use the continuity of the φ i and the fact that clq [0, 1] = [0, 1], which yields the desired result 7. By the convexity of f i, Karamata s majorization inequality Lemma A.4 implies that t t f i t + f i 3f i t and f i f i t + f i 0. That is, we have 1 1 φ i 1 = f i t + f i t 3f i t 0 and φ i 0 = f i t f i 0 f i t 0. Using the intermediate value theorem, there exists r [0, 1] such that φ 1 r = 0, and as signφ 1 = signφ, we have φ 1 r = φ r = 0. Rewriting this using the definition of φ 1, φ, we have t 0 = φ i r = rf i t + f i r + 1f i t 1 rf i 0, which we rewrite as t f i = 1 r + 1f it + 1 rf i 0 rf i t. As f 1 x = f x for x {0, t, t}, the above equation guarantees that f 1 t/ = f t/, completing our proof of claim a. A similar technique yields our claim b, that is, that if 0, t, t S, then 3t/ S. Consider now functions φ i : [0, 1] R defined by φ i q = f i 3t + q f i0 q + 1 f i t 1 q f it As in our argument for claim a, we have signφ 1 = signφ on Q [0, 1]: if we assume q = r for some s, r N with s r, then φ s iq 0 if and only if 3t 3t f i + + f i + q f i0 + + q f i0 f i t + + f i t + ft +... ft. }{{}}{{}}{{}}{{} r+s times s r times s times r times Using the order-equivalence of f 1 and f by taking m = s + r note that s 3t + r 0 = r + st + s rt, we see that signφ 1 q = signφ q for q Q [0, 1]. The obvious extension to [0, 1] follows, so that signφ 1 = signφ on [0, 1]. Now, note that by convexity of the f i, Karamata s majorization inequality Lemma A.4 implies φ i 0 = 1 3 f i t f i t f i t 0 and φ i 1 = 1 3 f i t + f i 0 3f i t 0. 1

13 The intermediate value theorem implies there exists r [0, 1] such that φ i r = 0 for i = 1,, as signφ 1 = signφ, whence 3t φ i r = f i + r f i0 r + 1 f i t 1 r f it = 0. This immediately implies f i 3t = r + 1 f i t + 1 r f it r f i0. As f 1 x = f x for x = 0, t, t, the above formula implies f 1 x = f x also at x = 3t/, yielding our claim b. The final part of our argument to note that claims a and b can be generalized to any three numbers {s, s+t, s+t} [0, ]. That is, if {s, s+t, s+t} S, then s+ 1 t and s+ 3 t belong to S. This follows when we define the shifted functions f i x = f i s + x, noting that f 1 and f are order equivalent Def. A. Applying claims a and b at 0, t, t to the f i, we see that f 1 t/ = f t/ and f 1 3t/ = f 3t/. An inductive argument immediately implies that all the dyadic numbers, those rationals of the form q = m/ n, m = 0,..., n+1, belong to the set S. As f 1 and f are continuous, we have that S = {x 0 f 1 x = f x} is closed, and by the denseness of the dyadic numbers, it must be the case that S [0, ]. One may use the argument we have shown above for proof of the last observation to get the following corollary, Corollary A.6. Let f 1 : R + R and f : R + R be order equivalent. If further f 1 a + ib = f a + ib for i = 0, 1,, then we have f 1 x = f x for all x [a, a + b]. Lemma A.7. Let f 1, f : R + R be order equivalent. If f 1 i = f i for i = 0, 1,, and further f 1 1 f 1 + f 1 0, then f 1 t = f t for all x R Proof. Convexity of f 1 together with the given condition implies f 1 1 f 1 f 1 0 < 0. Since f 1 and f are order equivalent, we obtain f 1 f f 0 < 0. Let n 3 be an arbitrary integer and consider functions φ i : [0, 1] R for i {1, }, defined by: φ i q = qf i n nqf i 0 n + q f i 1 1 qf i. We claim that 7 still holds. Indeed, first for rational numbers such as q = r, where r, s N s with s r we have s φ i q = rf i n + r + nsf i 0 n + r sf i 1 s rf i = f i n + + f i n + f }{{} i f i 0 f }{{} i f i 1 f }{{} i + + f i. }{{} r times ns r times n+r s times s r times 13

14 Hence, we have φ i q 0 if and only if f i n + + f i n + f }{{} i f i 0 f }{{} i f i 1 + f }{{} i + + f i. }{{} r times ns r times n+r s times s r times If ns r < 0 we can move the term containing f0 to the right hand side. Likewise, if n+r s < 0 we can move the term containing f1 to the left hand side. We set m = ns = r+ns r = n+r s+s r and note that n.r+0.ns r = n+r s.1+s r.. Since f 1 and f are order equivalent, φ 1 and φ satisfy 7 for q Q [0, 1]. Using the the continuity of φ 1 and φ we can extend this result to all numbers in [0, 1], which proves the claim. By convexity of f i, Karamata s Majorization Inequality Lemma A.4 implies that: φ i 1 = f i n + n 1f i 0 nf i 1 0 and φ i 0 = f i 1 f i f i 0 < 0. Using the intermediate value theorem, there exists u 0, 1] such that φ 1 u = 0 and since signφ 1 =signφ we obtain φ u = 0. We can rewrite this as: φ i u = uf i n nuf i 0 n + u f i 1 1 uf i = 0, which can also be written as: f i n = 1 u n + u f i1 + 1 uf i nu 1f i 0. This shows that f 1 n = f n. Since n was chosen arbitrarily, we obtain f 1 n = f n for all integers n 0. For completing the proof we only need to use Corollary A.6. Precisely, for any t R, pick a and b as integers with the condition a < t < a+b. Since, f 1 a+ib = f a+ib for i {0, 1, }, f 1 t = f t according to the Corollary A.6. Now, we have the tools required for proving the main result. Here we only show the difficult direction for K = 1. Proof of Theorem 4.1 One way is easy to prove. If, f 1 t = af t + b T t + c then by computing the f 1 and f -divergences we get: D f1 P 1, P,..., P P k q = ad f P 1, P,..., P P k q Since a > 0 this means that Φ 1 and Φ induce the same ordering on quantizers, as required. For proving the other way assuming K = 1, using the A.3 we observe that f 1 and f, are order equivalent. Denote the set T i = {t R + f i t + f i 0 f i t}, for i {1, }. We have two different cases: a T 1. Pick any t T 1. Note that t T, according to the ordering equivalence of f 1 and f. Define f i t = f i t/t for i {1, }. It is easy to check that f 1 t and f t are order equivalent. Choose numbers a, b, and c as: a = f f 1 f 1 1 f 0 + f f 1, b = f 1 0 a f 0, c = f 1 1 f

15 Construct a new function f t = a f t + ct + b. According to the choice of a, b, and, c, we have f 1 t = f t for t {0, 1, }. Furthermore, f 1 1 f 1 + f 1 0, since t T 1. Lemma A.7 implies that f 1 t = f t for all t R. That is, f 1 t/t = af t/t + ct + b for all t R. Since, t > 0 is fixed we obtain f 1 t = af t + ct t + b, which implies the result in this case. b T 1 =. Then for all t R +, f 1 t + f 1 0 = f 1 t. This also implies that f t + f 0 = f t for all t R +, regarding to the ordering equivalence. Using the convexity of f i, we get that f i is an affine function on R. Therefore, there exists b, c R such that f 1 t = f t + bt + c, as desired. References [LLW04] Yoonkyung Lee, Yi Lin, and Grace Wahba. Multicategory support vector machines: Theory and application to the classification of microarray data and satellite radiance data. Journal of the American Statistical Association, 99465:67 81, 004. [LV06] F. Liese and I. Vajda. On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 510: , 006. [NWJ09] Xuanlong Nguyen, Martin J Wainwright, and Michael I Jordan. On surrogate loss functions and f-divergences. The Annals of Statistics, pages , 009. [TB07] Ambuj Tewari and Peter L Bartlett. On the consistency of multiclass classification methods. The Journal of Machine Learning Research, 8: , 007. [Tsi93] John N. Tsitsiklis. Decentralized detection. In Advances in Signal Processing, Vol., pages JAI Press, [Zha04] Tong Zhang. Statistical analysis of some multi-category large margin classification methods. The Journal of Machine Learning Research, 5:15 151,

Stanford Statistics 311/Electrical Engineering 377

Stanford Statistics 311/Electrical Engineering 377 I. Bayes risk in classification problems a. Recall definition (1.2.3) of f-divergence between two distributions P and Q as ( ) p(x) D f (P Q) : q(x)f dx, q(x) where f : R + R is a convex function satisfying

More information

Surrogate loss functions, divergences and decentralized detection

Surrogate loss functions, divergences and decentralized detection Surrogate loss functions, divergences and decentralized detection XuanLong Nguyen Department of Electrical Engineering and Computer Science U.C. Berkeley Advisors: Michael Jordan & Martin Wainwright 1

More information

Divergences, surrogate loss functions and experimental design

Divergences, surrogate loss functions and experimental design Divergences, surrogate loss functions and experimental design XuanLong Nguyen University of California Berkeley, CA 94720 xuanlong@cs.berkeley.edu Martin J. Wainwright University of California Berkeley,

More information

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing

On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing On Information Divergence Measures, Surrogate Loss Functions and Decentralized Hypothesis Testing XuanLong Nguyen Martin J. Wainwright Michael I. Jordan Electrical Engineering & Computer Science Department

More information

Decentralized decision making with spatially distributed data

Decentralized decision making with spatially distributed data Decentralized decision making with spatially distributed data XuanLong Nguyen Department of Statistics University of Michigan Acknowledgement: Michael Jordan, Martin Wainwright, Ram Rajagopal, Pravin Varaiya

More information

Lectures 22-23: Conditional Expectations

Lectures 22-23: Conditional Expectations Lectures 22-23: Conditional Expectations 1.) Definitions Let X be an integrable random variable defined on a probability space (Ω, F 0, P ) and let F be a sub-σ-algebra of F 0. Then the conditional expectation

More information

Lebesgue-Radon-Nikodym Theorem

Lebesgue-Radon-Nikodym Theorem Lebesgue-Radon-Nikodym Theorem Matt Rosenzweig 1 Lebesgue-Radon-Nikodym Theorem In what follows, (, A) will denote a measurable space. We begin with a review of signed measures. 1.1 Signed Measures Definition

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Dynkin (λ-) and π-systems; monotone classes of sets, and of functions with some examples of application (mainly of a probabilistic flavor)

Dynkin (λ-) and π-systems; monotone classes of sets, and of functions with some examples of application (mainly of a probabilistic flavor) Dynkin (λ-) and π-systems; monotone classes of sets, and of functions with some examples of application (mainly of a probabilistic flavor) Matija Vidmar February 7, 2018 1 Dynkin and π-systems Some basic

More information

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi Real Analysis Math 3AH Rudin, Chapter # Dominique Abdi.. If r is rational (r 0) and x is irrational, prove that r + x and rx are irrational. Solution. Assume the contrary, that r+x and rx are rational.

More information

A Bahadur Representation of the Linear Support Vector Machine

A Bahadur Representation of the Linear Support Vector Machine A Bahadur Representation of the Linear Support Vector Machine Yoonkyung Lee Department of Statistics The Ohio State University October 7, 2008 Data Mining and Statistical Learning Study Group Outline Support

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

(U) =, if 0 U, 1 U, (U) = X, if 0 U, and 1 U. (U) = E, if 0 U, but 1 U. (U) = X \ E if 0 U, but 1 U. n=1 A n, then A M.

(U) =, if 0 U, 1 U, (U) = X, if 0 U, and 1 U. (U) = E, if 0 U, but 1 U. (U) = X \ E if 0 U, but 1 U. n=1 A n, then A M. 1. Abstract Integration The main reference for this section is Rudin s Real and Complex Analysis. The purpose of developing an abstract theory of integration is to emphasize the difference between the

More information

If Y and Y 0 satisfy (1-2), then Y = Y 0 a.s.

If Y and Y 0 satisfy (1-2), then Y = Y 0 a.s. 20 6. CONDITIONAL EXPECTATION Having discussed at length the limit theory for sums of independent random variables we will now move on to deal with dependent random variables. An important tool in this

More information

MATH MEASURE THEORY AND FOURIER ANALYSIS. Contents

MATH MEASURE THEORY AND FOURIER ANALYSIS. Contents MATH 3969 - MEASURE THEORY AND FOURIER ANALYSIS ANDREW TULLOCH Contents 1. Measure Theory 2 1.1. Properties of Measures 3 1.2. Constructing σ-algebras and measures 3 1.3. Properties of the Lebesgue measure

More information

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016

MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 MGMT 69000: Topics in High-dimensional Data Analysis Falll 2016 Lecture 14: Information Theoretic Methods Lecturer: Jiaming Xu Scribe: Hilda Ibriga, Adarsh Barik, December 02, 2016 Outline f-divergence

More information

Measure and integration

Measure and integration Chapter 5 Measure and integration In calculus you have learned how to calculate the size of different kinds of sets: the length of a curve, the area of a region or a surface, the volume or mass of a solid.

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

ADECENTRALIZED detection system typically involves a

ADECENTRALIZED detection system typically involves a IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL 53, NO 11, NOVEMBER 2005 4053 Nonparametric Decentralized Detection Using Kernel Methods XuanLong Nguyen, Martin J Wainwright, Member, IEEE, and Michael I Jordan,

More information

( f ^ M _ M 0 )dµ (5.1)

( f ^ M _ M 0 )dµ (5.1) 47 5. LEBESGUE INTEGRAL: GENERAL CASE Although the Lebesgue integral defined in the previous chapter is in many ways much better behaved than the Riemann integral, it shares its restriction to bounded

More information

MATH 131A: REAL ANALYSIS (BIG IDEAS)

MATH 131A: REAL ANALYSIS (BIG IDEAS) MATH 131A: REAL ANALYSIS (BIG IDEAS) Theorem 1 (The Triangle Inequality). For all x, y R we have x + y x + y. Proposition 2 (The Archimedean property). For each x R there exists an n N such that n > x.

More information

THEOREMS, ETC., FOR MATH 515

THEOREMS, ETC., FOR MATH 515 THEOREMS, ETC., FOR MATH 515 Proposition 1 (=comment on page 17). If A is an algebra, then any finite union or finite intersection of sets in A is also in A. Proposition 2 (=Proposition 1.1). For every

More information

5 Measure theory II. (or. lim. Prove the proposition. 5. For fixed F A and φ M define the restriction of φ on F by writing.

5 Measure theory II. (or. lim. Prove the proposition. 5. For fixed F A and φ M define the restriction of φ on F by writing. 5 Measure theory II 1. Charges (signed measures). Let (Ω, A) be a σ -algebra. A map φ: A R is called a charge, (or signed measure or σ -additive set function) if φ = φ(a j ) (5.1) A j for any disjoint

More information

Nonparametric Decentralized Detection using Kernel Methods

Nonparametric Decentralized Detection using Kernel Methods Nonparametric Decentralized Detection using Kernel Methods XuanLong Nguyen xuanlong@cs.berkeley.edu Martin J. Wainwright, wainwrig@eecs.berkeley.edu Michael I. Jordan, jordan@cs.berkeley.edu Electrical

More information

Decentralized Detection and Classification using Kernel Methods

Decentralized Detection and Classification using Kernel Methods Decentralized Detection and Classification using Kernel Methods XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@cs.berkeley.edu Martin J. Wainwright Electrical Engineering

More information

Are You a Bayesian or a Frequentist?

Are You a Bayesian or a Frequentist? Are You a Bayesian or a Frequentist? Michael I. Jordan Department of EECS Department of Statistics University of California, Berkeley http://www.cs.berkeley.edu/ jordan 1 Statistical Inference Bayesian

More information

MATH 117 LECTURE NOTES

MATH 117 LECTURE NOTES MATH 117 LECTURE NOTES XIN ZHOU Abstract. This is the set of lecture notes for Math 117 during Fall quarter of 2017 at UC Santa Barbara. The lectures follow closely the textbook [1]. Contents 1. The set

More information

On surrogate loss functions and f-divergences

On surrogate loss functions and f-divergences On surrogate loss functions and f-divergences XuanLong Nguyen, Martin J. Wainwright, xuanlong.nguyen@stat.duke.edu wainwrig@stat.berkeley.edu Michael I. Jordan, jordan@stat.berkeley.edu Department of Statistical

More information

Lebesgue measure and integration

Lebesgue measure and integration Chapter 4 Lebesgue measure and integration If you look back at what you have learned in your earlier mathematics courses, you will definitely recall a lot about area and volume from the simple formulas

More information

1 Definition of the Riemann integral

1 Definition of the Riemann integral MAT337H1, Introduction to Real Analysis: notes on Riemann integration 1 Definition of the Riemann integral Definition 1.1. Let [a, b] R be a closed interval. A partition P of [a, b] is a finite set of

More information

Formal Groups. Niki Myrto Mavraki

Formal Groups. Niki Myrto Mavraki Formal Groups Niki Myrto Mavraki Contents 1. Introduction 1 2. Some preliminaries 2 3. Formal Groups (1 dimensional) 2 4. Groups associated to formal groups 9 5. The Invariant Differential 11 6. The Formal

More information

Integral Jensen inequality

Integral Jensen inequality Integral Jensen inequality Let us consider a convex set R d, and a convex function f : (, + ]. For any x,..., x n and λ,..., λ n with n λ i =, we have () f( n λ ix i ) n λ if(x i ). For a R d, let δ a

More information

Convergence of generalized entropy minimizers in sequences of convex problems

Convergence of generalized entropy minimizers in sequences of convex problems Proceedings IEEE ISIT 206, Barcelona, Spain, 2609 263 Convergence of generalized entropy minimizers in sequences of convex problems Imre Csiszár A Rényi Institute of Mathematics Hungarian Academy of Sciences

More information

CHAPTER I THE RIESZ REPRESENTATION THEOREM

CHAPTER I THE RIESZ REPRESENTATION THEOREM CHAPTER I THE RIESZ REPRESENTATION THEOREM We begin our study by identifying certain special kinds of linear functionals on certain special vector spaces of functions. We describe these linear functionals

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization

Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

Support Vector Machines for Classification: A Statistical Portrait

Support Vector Machines for Classification: A Statistical Portrait Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,

More information

CHAPTER 6. Differentiation

CHAPTER 6. Differentiation CHPTER 6 Differentiation The generalization from elementary calculus of differentiation in measure theory is less obvious than that of integration, and the methods of treating it are somewhat involved.

More information

Problem Set. Problem Set #1. Math 5322, Fall March 4, 2002 ANSWERS

Problem Set. Problem Set #1. Math 5322, Fall March 4, 2002 ANSWERS Problem Set Problem Set #1 Math 5322, Fall 2001 March 4, 2002 ANSWRS i All of the problems are from Chapter 3 of the text. Problem 1. [Problem 2, page 88] If ν is a signed measure, is ν-null iff ν () 0.

More information

x log x, which is strictly convex, and use Jensen s Inequality:

x log x, which is strictly convex, and use Jensen s Inequality: 2. Information measures: mutual information 2.1 Divergence: main inequality Theorem 2.1 (Information Inequality). D(P Q) 0 ; D(P Q) = 0 iff P = Q Proof. Let ϕ(x) x log x, which is strictly convex, and

More information

Defining the Integral

Defining the Integral Defining the Integral In these notes we provide a careful definition of the Lebesgue integral and we prove each of the three main convergence theorems. For the duration of these notes, let (, M, µ) be

More information

Chapter 8. General Countably Additive Set Functions. 8.1 Hahn Decomposition Theorem

Chapter 8. General Countably Additive Set Functions. 8.1 Hahn Decomposition Theorem Chapter 8 General Countably dditive Set Functions In Theorem 5.2.2 the reader saw that if f : X R is integrable on the measure space (X,, µ) then we can define a countably additive set function ν on by

More information

Mathematics Course 111: Algebra I Part I: Algebraic Structures, Sets and Permutations

Mathematics Course 111: Algebra I Part I: Algebraic Structures, Sets and Permutations Mathematics Course 111: Algebra I Part I: Algebraic Structures, Sets and Permutations D. R. Wilkins Academic Year 1996-7 1 Number Systems and Matrix Algebra Integers The whole numbers 0, ±1, ±2, ±3, ±4,...

More information

Math 4121 Spring 2012 Weaver. Measure Theory. 1. σ-algebras

Math 4121 Spring 2012 Weaver. Measure Theory. 1. σ-algebras Math 4121 Spring 2012 Weaver Measure Theory 1. σ-algebras A measure is a function which gauges the size of subsets of a given set. In general we do not ask that a measure evaluate the size of every subset,

More information

Probability Theory. Richard F. Bass

Probability Theory. Richard F. Bass Probability Theory Richard F. Bass ii c Copyright 2014 Richard F. Bass Contents 1 Basic notions 1 1.1 A few definitions from measure theory............. 1 1.2 Definitions............................. 2

More information

02. Measure and integral. 1. Borel-measurable functions and pointwise limits

02. Measure and integral. 1. Borel-measurable functions and pointwise limits (October 3, 2017) 02. Measure and integral Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ [This document is http://www.math.umn.edu/ garrett/m/real/notes 2017-18/02 measure and integral.pdf]

More information

2a Large deviation principle (LDP) b Contraction principle c Change of measure... 10

2a Large deviation principle (LDP) b Contraction principle c Change of measure... 10 Tel Aviv University, 2007 Large deviations 5 2 Basic notions 2a Large deviation principle LDP......... 5 2b Contraction principle................ 9 2c Change of measure................. 10 The formalism

More information

Calibrated Surrogate Losses

Calibrated Surrogate Losses EECS 598: Statistical Learning Theory, Winter 2014 Topic 14 Calibrated Surrogate Losses Lecturer: Clayton Scott Scribe: Efrén Cruz Cortés Disclaimer: These notes have not been subjected to the usual scrutiny

More information

FUNDAMENTALS OF REAL ANALYSIS by. IV.1. Differentiation of Monotonic Functions

FUNDAMENTALS OF REAL ANALYSIS by. IV.1. Differentiation of Monotonic Functions FUNDAMNTALS OF RAL ANALYSIS by Doğan Çömez IV. DIFFRNTIATION AND SIGND MASURS IV.1. Differentiation of Monotonic Functions Question: Can we calculate f easily? More eplicitly, can we hope for a result

More information

Integration on Measure Spaces

Integration on Measure Spaces Chapter 3 Integration on Measure Spaces In this chapter we introduce the general notion of a measure on a space X, define the class of measurable functions, and define the integral, first on a class of

More information

Recall that if X is a compact metric space, C(X), the space of continuous (real-valued) functions on X, is a Banach space with the norm

Recall that if X is a compact metric space, C(X), the space of continuous (real-valued) functions on X, is a Banach space with the norm Chapter 13 Radon Measures Recall that if X is a compact metric space, C(X), the space of continuous (real-valued) functions on X, is a Banach space with the norm (13.1) f = sup x X f(x). We want to identify

More information

Lecture 10 February 23

Lecture 10 February 23 EECS 281B / STAT 241B: Advanced Topics in Statistical LearningSpring 2009 Lecture 10 February 23 Lecturer: Martin Wainwright Scribe: Dave Golland Note: These lecture notes are still rough, and have only

More information

P (A G) dp G P (A G)

P (A G) dp G P (A G) First homework assignment. Due at 12:15 on 22 September 2016. Homework 1. We roll two dices. X is the result of one of them and Z the sum of the results. Find E [X Z. Homework 2. Let X be a r.v.. Assume

More information

Cover Page. The handle holds various files of this Leiden University dissertation

Cover Page. The handle   holds various files of this Leiden University dissertation Cover Page The handle http://hdl.handle.net/1887/32076 holds various files of this Leiden University dissertation Author: Junjiang Liu Title: On p-adic decomposable form inequalities Issue Date: 2015-03-05

More information

Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results

Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results Sparseness Versus Estimating Conditional Probabilities: Some Asymptotic Results Peter L. Bartlett 1 and Ambuj Tewari 2 1 Division of Computer Science and Department of Statistics University of California,

More information

EC 521 MATHEMATICAL METHODS FOR ECONOMICS. Lecture 1: Preliminaries

EC 521 MATHEMATICAL METHODS FOR ECONOMICS. Lecture 1: Preliminaries EC 521 MATHEMATICAL METHODS FOR ECONOMICS Lecture 1: Preliminaries Murat YILMAZ Boğaziçi University In this lecture we provide some basic facts from both Linear Algebra and Real Analysis, which are going

More information

Notes on Measure, Probability and Stochastic Processes. João Lopes Dias

Notes on Measure, Probability and Stochastic Processes. João Lopes Dias Notes on Measure, Probability and Stochastic Processes João Lopes Dias Departamento de Matemática, ISEG, Universidade de Lisboa, Rua do Quelhas 6, 1200-781 Lisboa, Portugal E-mail address: jldias@iseg.ulisboa.pt

More information

Stanford Mathematics Department Math 205A Lecture Supplement #4 Borel Regular & Radon Measures

Stanford Mathematics Department Math 205A Lecture Supplement #4 Borel Regular & Radon Measures 2 1 Borel Regular Measures We now state and prove an important regularity property of Borel regular outer measures: Stanford Mathematics Department Math 205A Lecture Supplement #4 Borel Regular & Radon

More information

THE INVERSE FUNCTION THEOREM

THE INVERSE FUNCTION THEOREM THE INVERSE FUNCTION THEOREM W. PATRICK HOOPER The implicit function theorem is the following result: Theorem 1. Let f be a C 1 function from a neighborhood of a point a R n into R n. Suppose A = Df(a)

More information

Lecture 5 Channel Coding over Continuous Channels

Lecture 5 Channel Coding over Continuous Channels Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From

More information

MA359 Measure Theory

MA359 Measure Theory A359 easure Theory Thomas Reddington Usman Qureshi April 8, 204 Contents Real Line 3. Cantor set.................................................. 5 2 General easures 2 2. Product spaces...............................................

More information

Banach Spaces II: Elementary Banach Space Theory

Banach Spaces II: Elementary Banach Space Theory BS II c Gabriel Nagy Banach Spaces II: Elementary Banach Space Theory Notes from the Functional Analysis Course (Fall 07 - Spring 08) In this section we introduce Banach spaces and examine some of their

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Consistency of Nearest Neighbor Methods

Consistency of Nearest Neighbor Methods E0 370 Statistical Learning Theory Lecture 16 Oct 25, 2011 Consistency of Nearest Neighbor Methods Lecturer: Shivani Agarwal Scribe: Arun Rajkumar 1 Introduction In this lecture we return to the study

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

MATHS 730 FC Lecture Notes March 5, Introduction

MATHS 730 FC Lecture Notes March 5, Introduction 1 INTRODUCTION MATHS 730 FC Lecture Notes March 5, 2014 1 Introduction Definition. If A, B are sets and there exists a bijection A B, they have the same cardinality, which we write as A, #A. If there exists

More information

Functional Analysis I

Functional Analysis I Functional Analysis I Course Notes by Stefan Richter Transcribed and Annotated by Gregory Zitelli Polar Decomposition Definition. An operator W B(H) is called a partial isometry if W x = X for all x (ker

More information

2 (Bonus). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

2 (Bonus). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure? MA 645-4A (Real Analysis), Dr. Chernov Homework assignment 1 (Due 9/5). Prove that every countable set A is measurable and µ(a) = 0. 2 (Bonus). Let A consist of points (x, y) such that either x or y is

More information

2 Sequences, Continuity, and Limits

2 Sequences, Continuity, and Limits 2 Sequences, Continuity, and Limits In this chapter, we introduce the fundamental notions of continuity and limit of a real-valued function of two variables. As in ACICARA, the definitions as well as proofs

More information

Statistical Properties of Large Margin Classifiers

Statistical Properties of Large Margin Classifiers Statistical Properties of Large Margin Classifiers Peter Bartlett Division of Computer Science and Department of Statistics UC Berkeley Joint work with Mike Jordan, Jon McAuliffe, Ambuj Tewari. slides

More information

Some Background Math Notes on Limsups, Sets, and Convexity

Some Background Math Notes on Limsups, Sets, and Convexity EE599 STOCHASTIC NETWORK OPTIMIZATION, MICHAEL J. NEELY, FALL 2008 1 Some Background Math Notes on Limsups, Sets, and Convexity I. LIMITS Let f(t) be a real valued function of time. Suppose f(t) converges

More information

On divergences, surrogate loss functions, and decentralized detection

On divergences, surrogate loss functions, and decentralized detection On divergences, surrogate loss functions, and decentralized detection XuanLong Nguyen Computer Science Division University of California, Berkeley xuanlong@eecs.berkeley.edu Martin J. Wainwright Statistics

More information

Math 118B Solutions. Charles Martin. March 6, d i (x i, y i ) + d i (y i, z i ) = d(x, y) + d(y, z). i=1

Math 118B Solutions. Charles Martin. March 6, d i (x i, y i ) + d i (y i, z i ) = d(x, y) + d(y, z). i=1 Math 8B Solutions Charles Martin March 6, Homework Problems. Let (X i, d i ), i n, be finitely many metric spaces. Construct a metric on the product space X = X X n. Proof. Denote points in X as x = (x,

More information

Real Analysis Problems

Real Analysis Problems Real Analysis Problems Cristian E. Gutiérrez September 14, 29 1 1 CONTINUITY 1 Continuity Problem 1.1 Let r n be the sequence of rational numbers and Prove that f(x) = 1. f is continuous on the irrationals.

More information

δ xj β n = 1 n Theorem 1.1. The sequence {P n } satisfies a large deviation principle on M(X) with the rate function I(β) given by

δ xj β n = 1 n Theorem 1.1. The sequence {P n } satisfies a large deviation principle on M(X) with the rate function I(β) given by . Sanov s Theorem Here we consider a sequence of i.i.d. random variables with values in some complete separable metric space X with a common distribution α. Then the sample distribution β n = n maps X

More information

2 Lebesgue integration

2 Lebesgue integration 2 Lebesgue integration 1. Let (, A, µ) be a measure space. We will always assume that µ is complete, otherwise we first take its completion. The example to have in mind is the Lebesgue measure on R n,

More information

Chapter 4. The dominated convergence theorem and applications

Chapter 4. The dominated convergence theorem and applications Chapter 4. The dominated convergence theorem and applications The Monotone Covergence theorem is one of a number of key theorems alllowing one to exchange limits and [Lebesgue] integrals (or derivatives

More information

An inverse of Sanov s theorem

An inverse of Sanov s theorem An inverse of Sanov s theorem Ayalvadi Ganesh and Neil O Connell BRIMS, Hewlett-Packard Labs, Bristol Abstract Let X k be a sequence of iid random variables taking values in a finite set, and consider

More information

Logarithmic Sobolev inequalities in discrete product spaces: proof by a transportation cost distance

Logarithmic Sobolev inequalities in discrete product spaces: proof by a transportation cost distance Logarithmic Sobolev inequalities in discrete product spaces: proof by a transportation cost distance Katalin Marton Alfréd Rényi Institute of Mathematics of the Hungarian Academy of Sciences Relative entropy

More information

Lecture Notes on Support Vector Machine

Lecture Notes on Support Vector Machine Lecture Notes on Support Vector Machine Feng Li fli@sdu.edu.cn Shandong University, China 1 Hyperplane and Margin In a n-dimensional space, a hyper plane is defined by ω T x + b = 0 (1) where ω R n is

More information

+ 2x sin x. f(b i ) f(a i ) < ɛ. i=1. i=1

+ 2x sin x. f(b i ) f(a i ) < ɛ. i=1. i=1 Appendix To understand weak derivatives and distributional derivatives in the simplest context of functions of a single variable, we describe without proof some results from real analysis (see [7] and

More information

Asymptotic efficiency of simple decisions for the compound decision problem

Asymptotic efficiency of simple decisions for the compound decision problem Asymptotic efficiency of simple decisions for the compound decision problem Eitan Greenshtein and Ya acov Ritov Department of Statistical Sciences Duke University Durham, NC 27708-0251, USA e-mail: eitan.greenshtein@gmail.com

More information

Markov processes Course note 2. Martingale problems, recurrence properties of discrete time chains.

Markov processes Course note 2. Martingale problems, recurrence properties of discrete time chains. Institute for Applied Mathematics WS17/18 Massimiliano Gubinelli Markov processes Course note 2. Martingale problems, recurrence properties of discrete time chains. [version 1, 2017.11.1] We introduce

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 9 10/2/2013. Conditional expectations, filtration and martingales

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 9 10/2/2013. Conditional expectations, filtration and martingales MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 9 10/2/2013 Conditional expectations, filtration and martingales Content. 1. Conditional expectations 2. Martingales, sub-martingales

More information

Real Analysis Notes. Thomas Goller

Real Analysis Notes. Thomas Goller Real Analysis Notes Thomas Goller September 4, 2011 Contents 1 Abstract Measure Spaces 2 1.1 Basic Definitions........................... 2 1.2 Measurable Functions........................ 2 1.3 Integration..............................

More information

Real Analysis, 2nd Edition, G.B.Folland Elements of Functional Analysis

Real Analysis, 2nd Edition, G.B.Folland Elements of Functional Analysis Real Analysis, 2nd Edition, G.B.Folland Chapter 5 Elements of Functional Analysis Yung-Hsiang Huang 5.1 Normed Vector Spaces 1. Note for any x, y X and a, b K, x+y x + y and by ax b y x + b a x. 2. It

More information

Analysis Comprehensive Exam Questions Fall 2008

Analysis Comprehensive Exam Questions Fall 2008 Analysis Comprehensive xam Questions Fall 28. (a) Let R be measurable with finite Lebesgue measure. Suppose that {f n } n N is a bounded sequence in L 2 () and there exists a function f such that f n (x)

More information

ABELIAN SELF-COMMUTATORS IN FINITE FACTORS

ABELIAN SELF-COMMUTATORS IN FINITE FACTORS ABELIAN SELF-COMMUTATORS IN FINITE FACTORS GABRIEL NAGY Abstract. An abelian self-commutator in a C*-algebra A is an element of the form A = X X XX, with X A, such that X X and XX commute. It is shown

More information

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure? MA 645-4A (Real Analysis), Dr. Chernov Homework assignment 1 (Due ). Show that the open disk x 2 + y 2 < 1 is a countable union of planar elementary sets. Show that the closed disk x 2 + y 2 1 is a countable

More information

2. Metric Spaces. 2.1 Definitions etc.

2. Metric Spaces. 2.1 Definitions etc. 2. Metric Spaces 2.1 Definitions etc. The procedure in Section for regarding R as a topological space may be generalized to many other sets in which there is some kind of distance (formally, sets with

More information

Linear Codes, Target Function Classes, and Network Computing Capacity

Linear Codes, Target Function Classes, and Network Computing Capacity Linear Codes, Target Function Classes, and Network Computing Capacity Rathinakumar Appuswamy, Massimo Franceschetti, Nikhil Karamchandani, and Kenneth Zeger IEEE Transactions on Information Theory Submitted:

More information

Classification and Support Vector Machine

Classification and Support Vector Machine Classification and Support Vector Machine Yiyong Feng and Daniel P. Palomar The Hong Kong University of Science and Technology (HKUST) ELEC 5470 - Convex Optimization Fall 2017-18, HKUST, Hong Kong Outline

More information

On deterministic reformulations of distributionally robust joint chance constrained optimization problems

On deterministic reformulations of distributionally robust joint chance constrained optimization problems On deterministic reformulations of distributionally robust joint chance constrained optimization problems Weijun Xie and Shabbir Ahmed School of Industrial & Systems Engineering Georgia Institute of Technology,

More information

Margin Maximizing Loss Functions

Margin Maximizing Loss Functions Margin Maximizing Loss Functions Saharon Rosset, Ji Zhu and Trevor Hastie Department of Statistics Stanford University Stanford, CA, 94305 saharon, jzhu, hastie@stat.stanford.edu Abstract Margin maximizing

More information

Lebesgue Measure on R n

Lebesgue Measure on R n CHAPTER 2 Lebesgue Measure on R n Our goal is to construct a notion of the volume, or Lebesgue measure, of rather general subsets of R n that reduces to the usual volume of elementary geometrical sets

More information

Section 4.2: Mathematical Induction 1

Section 4.2: Mathematical Induction 1 Section 4.: Mathematical Induction 1 Over the next couple of sections, we shall consider a method of proof called mathematical induction. Induction is fairly complicated, but a very useful proof technique,

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

l(y j ) = 0 for all y j (1)

l(y j ) = 0 for all y j (1) Problem 1. The closed linear span of a subset {y j } of a normed vector space is defined as the intersection of all closed subspaces containing all y j and thus the smallest such subspace. 1 Show that

More information

Asymptotic Learning on Bayesian Social Networks

Asymptotic Learning on Bayesian Social Networks Asymptotic Learning on Bayesian Social Networks Elchanan Mossel Allan Sly Omer Tamuz January 29, 2014 Abstract Understanding information exchange and aggregation on networks is a central problem in theoretical

More information