Ambiguity Sets and their applications to SVM

Size: px

Start display at page:

Download "Ambiguity Sets and their applications to SVM"

Jennifer Lawrence
5 years ago
Views:

1 Ambiguity Sets and their applications to SVM Ammon Washburn University of Arizona April 22, 2016 Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

2 Introduction Go over some (very little) set theory Explain what are φ-divergences Apply them to Support Vector Machines Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

3 Measures and Probability Measures A measure space consists of three things (X, X, µ). After several weeks deep into measure theory you realize power sets are bad So for X = R, we just use X is the Borel sets and µ is Lebesgue measure A probability measure P is a measure that sums up to one Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

4 Overview of Important Concepts in Probability Theory For Lebesgue measure we can use integrals. µ(a) = A dx If a probability measure P is absolutely continuous with respect to (dominated by) Lebesgue measure then there exists a function p(x) so that P(A) = A p(x)dx p(x) is called the density of P and by properties of P we know that R p(x)dx = 1 Abstractly if P is dominated by Q then there exists a function dp dq (x) so that P(A) = dp A dq (x)dq In other words we just only have to worry about measure Q and then find that special function that makes it work Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

5 Examples Consider we have probability measure that is dominated by Lebesgue measure and has a density of 2x1 [0,1] (x) What is the probability of the set A = { 1 2 }? What is the probability of the set A = [1, )? What is the probability of the set A = [0, 1 2 ] Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

6 φ-divergences We can measure how different two points are by taking their distance x y 2. How can we measure the distance between two distributions (two probability measures)? D(P, Q) = φ( dp dq )dq = X φ( p(z) )q(z)dz (1) q(z) Where φ is a convex function and φ(1) = 0, 0φ(a/0) a lim t φ(t)/t, and 0φ(0/0) 0. Note that P must be dominated by Q (denoted P << Q) or the divergence is infinity. Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

7 Solving Robust Linear Optimization in discrete case Consider the following problem This is from Ben-Tal et al. (2013). min c w (2a) s.t. (a + Bp) w β p U (2b) U R m is our uncertainty region and we make it robust by requiring that the constraint must be fulfilled for all p U. Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

8 Theorem for RLOs and φ-divergences Theorem Let U = {p R m p 0, Cp d, D(p, q) ρ} then the constraint in equation (2) can be replaced by the following constraint. a w + d η + ρλ + λ m ( b q i φ i w c i η ) β λ i=1 η 0, λ 0 (3a) (3b) Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

9 Table 4 Some -Divergence Examples with Their Conjugates and Adjoints Divergence s t RCP Kullback Leibler e s 1 b t S.C. Burg entropy log 1 s s < 1 kl t S.C. J-divergence No closed form j t S.C. 2 -distance s s < 1 mc t CQP { Modified 2 -distance 1 s< 2 s + s 2 /4 s 2 c t CQP Hellinger distance -divergence of order >1 Variation distance Cressie Read s 1 s s<1 h t CQP ( ) / 1 s s + 1 t 1 ca { 1 s 1 v t LP s 1 s s 1 / 1 1 s< cr t CQP Notes. The last column indicates the tractability of (1). S.C., admits selfconcordant barrier. Figure 1: This figure is from Aharon Ben-Tal et al (2015) Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

10 Proof of Theorem (1) Our constraint can be turned into the following maximization problem. Then we can find the Lagrangian function L and dual objective function g. Just need to worry that min λ,η 0 g(λ, η) β { β max (a + Bp) w m p 0 Cp d, q i φ( p } i ) ρ q i g(λ, η) = max p 0 = max p 0 { (a + Bp) w + ρλ λ i=1 m i=1 q i φ( p } i ) + η (d Cp) q i (4) (5) L(p, λ, η) (6) Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

11 Proof of Theorem (2) Now we can show the following. g(λ, η) = a w + d η + ρλ + max p 0 = a w + d η + ρλ + = a w + d η + ρλ + = a w + d η + ρλ + m i=1 m i=1 m (p i (b i w) p i(c i η) λq iφ(p i /q i )) i=1 max p i 0 (p i(b i w c i η) λq iφ(p i /q i )) λq i max t 0 (t(b i w c i η)/λ φ(t)) m λq i φ (b i w c i η) i=1 Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

12 Corollary Theorem If we are just concerned with probability vectors then we can reduce the constraint in equation (2) to the following a w + η + ρλ + λ m ( b q i φ i w η ) β λ i=1 λ 0 (7a) (7b) Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

13 Application (1) Consider the robust newsvendor model (This is also from Ben-Tal et al. (2013)) max min Q p U m p i u(r(q, i)) (8) i=1 s.t. r(q, i) = v min(d i, Q) + s(q d i ) + l(d i Q) + cq (9) We just have the historical sample frequencies q. So U = {p R m e p = 1, D(p, q) ρ}. Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

14 Application (2) After applying the theorem and adjusting some things we get { max η ρλ λ Q,η,λ m i=1 q i φ ( u(r(q, d i )) η ) } (10) λ If u(x) is concave and non-increasing and v + l s (wrong in paper) then the problem is still convex Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

15 Ambiguity and SVM For most applications of SVM, the uncertainty in the data is continuous (w.r.t. Lebesgue measure) and not discrete The divergence of a continuous and discrete distribution is always infinity Maybe we can just get probabilities in the nominal distribution and then turn the discrete nominal distribution into a continuous one Lets try the same strategy as Ben-Tal et al. (2013) but with continuous distributions Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

16 SVM with Kantorovich metric The Kantorovich metric ρ is defined as follows { } ρ(p, Q) inf d(x, y)k(dx, dy) K Marginals of K are P and Q X (11) Now the distance between a continuous and discrete distribution is not infinity Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

17 Semi-infinite DR SVM The DR-SVM with uncertainty set U being defined with the Kantorovich metric is defined as follows. 1 min w,b 2 w w + C sup 1 y(x w b)p(dx, dy) (12) P U is equivalent to (Lee and Mehrotra) min w,b,t,u 1 2 w w + C ( 1 m X m ξ j + ηu ) i=1 s.t. ξ j 1 y(x w b) [d(x, x j ) + d y (y, y j )]u, (13a) (x, y) X, j (13b) t, u 0 (13c) Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

18 Sketch of proof We just consider the second part of (12) and use a fact from Shapiro (2001) that the following is the primal and dual of our problem. max E P[f (x, y)] P U s.t. E P [g i (x, y)] = b i, i = 1,..., t min ξ 0 b x s.t. t ξ i g i (x, y) f (x, y) C i=1 Where C = {f X : X f (x, y)p(dx, dy) 0, P U} Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

19 KL Divergence Let φ(x) = x log(x). This gives us the Kullback-Leibler Divergence. D(P, P 0 ) = X p(z) log( p(z) )dz (14) p 0 (z) We would like to use this to bound in a robust way the true probability P with the nominal probability P 0 which is given by the data Define U = {P : D(P, P 0 ) η} Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

20 Result Take our original problem and make it robust min h(w) s.t. min P U Pr P{H(w, x) 0} 1 ɛ then transform it to a tractable problem (Hu and Hong (2013)) where there is no ambiguity min h(w) s.t. Pr P0 {H(w, x) 0} 1 ɛ where ɛ = sup t>0 e η (t+1) ɛ 1 t ɛ Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

21 Proof (1) Consider the following problem. min max E P[H(w, x)] w W P U By multiplying by p 0(x) p(x) p 0 (x) and letting L(x) = p 0 (x) inner maximization problem to look like this. max E P0 [H(w, x)l(x)] s.t. E P0 [L log(l)] η, then we can change the L L where L = {L E P0 [L] = 1, L 0 a.s.} Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

22 Proof (2) l(α, L) = E P0 [H(w, x)l(x)] α(e P0 [L(x) log(l(x)) η] If we maximize this under L L and then take the minimum over α 0 then we solve the dual and by the convexity of the problem then they are equal. max E P0 [H(w, x)l(x) αl(x) log(l(x))] s.t. E P0 [L(x)] = 1 L(x) 0 Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

23 Proof (3) Now we make functionals J(f ) = E P0 [H(w, x)f (x) αf (x) log(f (x))] and J c (f ) = E P0 [L(x)] 1. Now we basically make a unconstrained optimization for these functionals and solve. After some functional analysis you get L (x) = eh(w,x)/α E P0 [e H(w,x)/α ] (15) Plug that back into l(l, α) and you get l(l, α) = v(α) = α log(e P0 [e H(w,x)/α ]) + αη Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

24 Proof (4) Now just notice that P P0 (H(w, x) 0) = E P0 [1 H(w,x) 0 (x)] and plug it into the above solution and you get the constraint from before. Pr P0 {H(w, x) 0} 1 ɛ e η (t + 1) ɛ 1 ɛ = sup ɛ t>0 t Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

25 References Aharon Ben-Tal, Dick Den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2): , Zhaolin Hu and L Jeff Hong. Kullback-leibler divergence constrained distributionally robust optimization. Available at Optimization Online, Changhyeok Lee and Sanjay Mehrotra. A distributionally-robust approach for finding support vector machines. Not Yet Published. Alexander Shapiro. On duality theory of conic linear problems. pages , Ammon Washburn (University of Arizona) Ambiguity Sets April 22, / 25

Robust Dual-Response Optimization

Yanıkoğlu, den Hertog, and Kleijnen Robust Dual-Response Optimization 29 May 1 June 1 / 24 Robust Dual-Response Optimization İhsan Yanıkoğlu, Dick den Hertog, Jack P.C. Kleijnen Özyeğin University, İstanbul,