Ambiguity Sets and their applications to SVM Ammon Washburn University of Arizona April 22, 2016 Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 1 / 25
Introduction Go over some (very little) set theory Explain what are φ-divergences Apply them to Support Vector Machines Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 2 / 25
Measures and Probability Measures A measure space consists of three things (X, X, µ). After several weeks deep into measure theory you realize power sets are bad So for X = R, we just use X is the Borel sets and µ is Lebesgue measure A probability measure P is a measure that sums up to one Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 3 / 25
Overview of Important Concepts in Probability Theory For Lebesgue measure we can use integrals. µ(a) = A dx If a probability measure P is absolutely continuous with respect to (dominated by) Lebesgue measure then there exists a function p(x) so that P(A) = A p(x)dx p(x) is called the density of P and by properties of P we know that R p(x)dx = 1 Abstractly if P is dominated by Q then there exists a function dp dq (x) so that P(A) = dp A dq (x)dq In other words we just only have to worry about measure Q and then find that special function that makes it work Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 4 / 25
Examples Consider we have probability measure that is dominated by Lebesgue measure and has a density of 2x1 [0,1] (x) What is the probability of the set A = { 1 2 }? What is the probability of the set A = [1, )? What is the probability of the set A = [0, 1 2 ] Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 5 / 25
φ-divergences We can measure how different two points are by taking their distance x y 2. How can we measure the distance between two distributions (two probability measures)? D(P, Q) = φ( dp dq )dq = X φ( p(z) )q(z)dz (1) q(z) Where φ is a convex function and φ(1) = 0, 0φ(a/0) a lim t φ(t)/t, and 0φ(0/0) 0. Note that P must be dominated by Q (denoted P << Q) or the divergence is infinity. Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 6 / 25
Solving Robust Linear Optimization in discrete case Consider the following problem This is from Ben-Tal et al. (2013). min c w (2a) s.t. (a + Bp) w β p U (2b) U R m is our uncertainty region and we make it robust by requiring that the constraint must be fulfilled for all p U. Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 7 / 25
Theorem for RLOs and φ-divergences Theorem Let U = {p R m p 0, Cp d, D(p, q) ρ} then the constraint in equation (2) can be replaced by the following constraint. a w + d η + ρλ + λ m ( b q i φ i w c i η ) β λ i=1 η 0, λ 0 (3a) (3b) Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 8 / 25
Table 4 Some -Divergence Examples with Their Conjugates and Adjoints Divergence s t RCP Kullback Leibler e s 1 b t S.C. Burg entropy log 1 s s < 1 kl t S.C. J-divergence No closed form j t S.C. 2 -distance 2 2 1 s s < 1 mc t CQP { Modified 2 -distance 1 s< 2 s + s 2 /4 s 2 c t CQP Hellinger distance -divergence of order >1 Variation distance Cressie Read s 1 s s<1 h t CQP ( ) / 1 s s + 1 t 1 ca { 1 s 1 v t LP s 1 s 1 1 1 s 1 / 1 1 s< 1 1 1 cr t CQP Notes. The last column indicates the tractability of (1). S.C., admits selfconcordant barrier. Figure 1: This figure is from Aharon Ben-Tal et al (2015) Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 9 / 25
Proof of Theorem (1) Our constraint can be turned into the following maximization problem. Then we can find the Lagrangian function L and dual objective function g. Just need to worry that min λ,η 0 g(λ, η) β { β max (a + Bp) w m p 0 Cp d, q i φ( p } i ) ρ q i g(λ, η) = max p 0 = max p 0 { (a + Bp) w + ρλ λ i=1 m i=1 q i φ( p } i ) + η (d Cp) q i (4) (5) L(p, λ, η) (6) Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 10 / 25
Proof of Theorem (2) Now we can show the following. g(λ, η) = a w + d η + ρλ + max p 0 = a w + d η + ρλ + = a w + d η + ρλ + = a w + d η + ρλ + m i=1 m i=1 m (p i (b i w) p i(c i η) λq iφ(p i /q i )) i=1 max p i 0 (p i(b i w c i η) λq iφ(p i /q i )) λq i max t 0 (t(b i w c i η)/λ φ(t)) m λq i φ (b i w c i η) i=1 Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 11 / 25
Corollary Theorem If we are just concerned with probability vectors then we can reduce the constraint in equation (2) to the following a w + η + ρλ + λ m ( b q i φ i w η ) β λ i=1 λ 0 (7a) (7b) Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 12 / 25
Application (1) Consider the robust newsvendor model (This is also from Ben-Tal et al. (2013)) max min Q p U m p i u(r(q, i)) (8) i=1 s.t. r(q, i) = v min(d i, Q) + s(q d i ) + l(d i Q) + cq (9) We just have the historical sample frequencies q. So U = {p R m e p = 1, D(p, q) ρ}. Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 13 / 25
Application (2) After applying the theorem and adjusting some things we get { max η ρλ λ Q,η,λ m i=1 q i φ ( u(r(q, d i )) η ) } (10) λ If u(x) is concave and non-increasing and v + l s (wrong in paper) then the problem is still convex Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 14 / 25
Ambiguity and SVM For most applications of SVM, the uncertainty in the data is continuous (w.r.t. Lebesgue measure) and not discrete The divergence of a continuous and discrete distribution is always infinity Maybe we can just get probabilities in the nominal distribution and then turn the discrete nominal distribution into a continuous one Lets try the same strategy as Ben-Tal et al. (2013) but with continuous distributions Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 15 / 25
SVM with Kantorovich metric The Kantorovich metric ρ is defined as follows { } ρ(p, Q) inf d(x, y)k(dx, dy) K Marginals of K are P and Q X (11) Now the distance between a continuous and discrete distribution is not infinity Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 16 / 25
Semi-infinite DR SVM The DR-SVM with uncertainty set U being defined with the Kantorovich metric is defined as follows. 1 min w,b 2 w w + C sup 1 y(x w b)p(dx, dy) (12) P U is equivalent to (Lee and Mehrotra) min w,b,t,u 1 2 w w + C ( 1 m X m ξ j + ηu ) i=1 s.t. ξ j 1 y(x w b) [d(x, x j ) + d y (y, y j )]u, (13a) (x, y) X, j (13b) t, u 0 (13c) Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 17 / 25
Sketch of proof We just consider the second part of (12) and use a fact from Shapiro (2001) that the following is the primal and dual of our problem. max E P[f (x, y)] P U s.t. E P [g i (x, y)] = b i, i = 1,..., t min ξ 0 b x s.t. t ξ i g i (x, y) f (x, y) C i=1 Where C = {f X : X f (x, y)p(dx, dy) 0, P U} Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 18 / 25
KL Divergence Let φ(x) = x log(x). This gives us the Kullback-Leibler Divergence. D(P, P 0 ) = X p(z) log( p(z) )dz (14) p 0 (z) We would like to use this to bound in a robust way the true probability P with the nominal probability P 0 which is given by the data Define U = {P : D(P, P 0 ) η} Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 19 / 25
Result Take our original problem and make it robust min h(w) s.t. min P U Pr P{H(w, x) 0} 1 ɛ then transform it to a tractable problem (Hu and Hong (2013)) where there is no ambiguity min h(w) s.t. Pr P0 {H(w, x) 0} 1 ɛ where ɛ = sup t>0 e η (t+1) ɛ 1 t ɛ Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 20 / 25
Proof (1) Consider the following problem. min max E P[H(w, x)] w W P U By multiplying by p 0(x) p(x) p 0 (x) and letting L(x) = p 0 (x) inner maximization problem to look like this. max E P0 [H(w, x)l(x)] s.t. E P0 [L log(l)] η, then we can change the L L where L = {L E P0 [L] = 1, L 0 a.s.} Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 21 / 25
Proof (2) l(α, L) = E P0 [H(w, x)l(x)] α(e P0 [L(x) log(l(x)) η] If we maximize this under L L and then take the minimum over α 0 then we solve the dual and by the convexity of the problem then they are equal. max E P0 [H(w, x)l(x) αl(x) log(l(x))] s.t. E P0 [L(x)] = 1 L(x) 0 Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 22 / 25
Proof (3) Now we make functionals J(f ) = E P0 [H(w, x)f (x) αf (x) log(f (x))] and J c (f ) = E P0 [L(x)] 1. Now we basically make a unconstrained optimization for these functionals and solve. After some functional analysis you get L (x) = eh(w,x)/α E P0 [e H(w,x)/α ] (15) Plug that back into l(l, α) and you get l(l, α) = v(α) = α log(e P0 [e H(w,x)/α ]) + αη Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 23 / 25
Proof (4) Now just notice that P P0 (H(w, x) 0) = E P0 [1 H(w,x) 0 (x)] and plug it into the above solution and you get the constraint from before. Pr P0 {H(w, x) 0} 1 ɛ e η (t + 1) ɛ 1 ɛ = sup ɛ t>0 t Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 24 / 25
References Aharon Ben-Tal, Dick Den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2): 341 357, 2013. Zhaolin Hu and L Jeff Hong. Kullback-leibler divergence constrained distributionally robust optimization. Available at Optimization Online, 2013. Changhyeok Lee and Sanjay Mehrotra. A distributionally-robust approach for finding support vector machines. Not Yet Published. Alexander Shapiro. On duality theory of conic linear problems. pages 135 165, 2001. Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 25 / 25