Ambiguity Sets and their applications to SVM

Similar documents
Robust Dual-Response Optimization

Distributionally Robust Stochastic Optimization with Wasserstein Distance

ICS-E4030 Kernel Methods in Machine Learning

Lagrange duality. The Lagrangian. We consider an optimization program of the form

Convex Optimization and Support Vector Machine

Lagrangian Duality Theory

CS-E4830 Kernel Methods in Machine Learning

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Posterior Regularization

Support Vector Machines

Lecture 18: Optimization Programming

The Lagrangian L : R d R m R r R is an (easier to optimize) lower bound on the original problem:

Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs

Semidefinite and Second Order Cone Programming Seminar Fall 2012 Project: Robust Optimization and its Application of Robust Portfolio Optimization

Proceedings of the 2014 Winter Simulation Conference A. Tolk, S. D. Diallo, I. O. Ryzhov, L. Yilmaz, S. Buckley, and J. A. Miller, eds.

Wasserstein GAN. Juho Lee. Jan 23, 2017

Random Convex Approximations of Ambiguous Chance Constrained Programs

Machine Learning. Support Vector Machines. Manfred Huber

Optimal Transport Methods in Operations Research and Statistics

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

A Two-Stage Moment Robust Optimization Model and its Solution Using Decomposition

Applications of Linear Programming

Optimal Transport in Risk Analysis

5. Duality. Lagrangian

CS Lecture 8 & 9. Lagrange Multipliers & Varitional Bounds

Support Vector Machines

Convex Optimization and SVM

Support Vector Machine (continued)

Convex Optimization Boyd & Vandenberghe. 5. Duality

Data-Driven Distributionally Robust Chance-Constrained Optimization with Wasserstein Metric

Linear and Combinatorial Optimization

THE stochastic and dynamic environments of many practical

Lecture Support Vector Machine (SVM) Classifiers

ML (cont.): SUPPORT VECTOR MACHINES

Constrained Optimization and Lagrangian Duality

Linear and non-linear programming

Additional Homework Problems

Lecture: Duality.

MATH 829: Introduction to Data Mining and Analysis Support vector machines and kernels

(Kernels +) Support Vector Machines

Convex Optimization in Classification Problems

Adjustable Robust Parameter Design with Unknown Distributions Yanikoglu, Ihsan; den Hertog, Dick; Kleijnen, J.P.C.

TWO-STAGE LIKELIHOOD ROBUST LINEAR PROGRAM WITH APPLICATION TO WATER ALLOCATION UNDER UNCERTAINTY

minimize x subject to (x 2)(x 4) u,

Safe Approximations of Chance Constraints Using Historical Data

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Welfare Maximization with Production Costs: A Primal Dual Approach

Quantifying Stochastic Model Errors via Robust Optimization

Series 7, May 22, 2018 (EM Convergence)

Max Margin-Classifier

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions

Quadratic Two-Stage Stochastic Optimization with Coherent Measures of Risk

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Math 273a: Optimization Subgradients of convex functions

Support Vector Machine

Support Vector Machines for Classification and Regression

Stability of optimization problems with stochastic dominance constraints

Homework Set #6 - Solutions

On deterministic reformulations of distributionally robust joint chance constrained optimization problems

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

(1) Consider the space S consisting of all continuous real-valued functions on the closed interval [0, 1]. For f, g S, define

Optimal control problems with PDE constraints

Tilburg University. Hidden Convexity in Partially Separable Optimization Ben-Tal, A.; den Hertog, Dick; Laurent, Monique. Publication date: 2011

Distributionally Robust Convex Optimization

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Convex Optimization & Lagrange Duality

Martingale optimal transport with Monge s cost function

Lecture 6: Conic Optimization September 8

Machine Learning A Geometric Approach

Operations Research Letters

Distributionally robust optimization techniques in batch bayesian optimisation

Surrogate loss functions, divergences and decentralized detection

1 Stochastic Dynamic Programming

Convex Optimization M2

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

HW1 solutions. 1. α Ef(x) β, where Ef(x) is the expected value of f(x), i.e., Ef(x) = n. i=1 p if(a i ). (The function f : R R is given.

Jeff Howbert Introduction to Machine Learning Winter

AdaGAN: Boosting Generative Models

Kernel Methods and Support Vector Machines

14. Duality. ˆ Upper and lower bounds. ˆ General duality. ˆ Constraint qualifications. ˆ Counterexample. ˆ Complementary slackness.

Theory of Probability Fall 2008

Information Theory Primer:

Distributionally Robust Convex Optimization

13: Variational inference II

U Logo Use Guidelines

Optimality Conditions for Constrained Optimization

Tutorial on Convex Optimization: Part II

MAT 578 FUNCTIONAL ANALYSIS EXERCISES

Convergence Analysis for Distributionally Robust Optimization and Equilibrium Problems*

Exercises Measure Theoretic Probability

Distributionally robust simple integer recourse

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Ambiguity in portfolio optimization

Convex Optimization. Dani Yogatama. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA. February 12, 2014

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Support Vector Machines

Lecture 7: Lagrangian Relaxation and Duality Theory

Transcription:

Ambiguity Sets and their applications to SVM Ammon Washburn University of Arizona April 22, 2016 Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 1 / 25

Introduction Go over some (very little) set theory Explain what are φ-divergences Apply them to Support Vector Machines Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 2 / 25

Measures and Probability Measures A measure space consists of three things (X, X, µ). After several weeks deep into measure theory you realize power sets are bad So for X = R, we just use X is the Borel sets and µ is Lebesgue measure A probability measure P is a measure that sums up to one Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 3 / 25

Overview of Important Concepts in Probability Theory For Lebesgue measure we can use integrals. µ(a) = A dx If a probability measure P is absolutely continuous with respect to (dominated by) Lebesgue measure then there exists a function p(x) so that P(A) = A p(x)dx p(x) is called the density of P and by properties of P we know that R p(x)dx = 1 Abstractly if P is dominated by Q then there exists a function dp dq (x) so that P(A) = dp A dq (x)dq In other words we just only have to worry about measure Q and then find that special function that makes it work Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 4 / 25

Examples Consider we have probability measure that is dominated by Lebesgue measure and has a density of 2x1 [0,1] (x) What is the probability of the set A = { 1 2 }? What is the probability of the set A = [1, )? What is the probability of the set A = [0, 1 2 ] Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 5 / 25

φ-divergences We can measure how different two points are by taking their distance x y 2. How can we measure the distance between two distributions (two probability measures)? D(P, Q) = φ( dp dq )dq = X φ( p(z) )q(z)dz (1) q(z) Where φ is a convex function and φ(1) = 0, 0φ(a/0) a lim t φ(t)/t, and 0φ(0/0) 0. Note that P must be dominated by Q (denoted P << Q) or the divergence is infinity. Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 6 / 25

Solving Robust Linear Optimization in discrete case Consider the following problem This is from Ben-Tal et al. (2013). min c w (2a) s.t. (a + Bp) w β p U (2b) U R m is our uncertainty region and we make it robust by requiring that the constraint must be fulfilled for all p U. Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 7 / 25

Theorem for RLOs and φ-divergences Theorem Let U = {p R m p 0, Cp d, D(p, q) ρ} then the constraint in equation (2) can be replaced by the following constraint. a w + d η + ρλ + λ m ( b q i φ i w c i η ) β λ i=1 η 0, λ 0 (3a) (3b) Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 8 / 25

Table 4 Some -Divergence Examples with Their Conjugates and Adjoints Divergence s t RCP Kullback Leibler e s 1 b t S.C. Burg entropy log 1 s s < 1 kl t S.C. J-divergence No closed form j t S.C. 2 -distance 2 2 1 s s < 1 mc t CQP { Modified 2 -distance 1 s< 2 s + s 2 /4 s 2 c t CQP Hellinger distance -divergence of order >1 Variation distance Cressie Read s 1 s s<1 h t CQP ( ) / 1 s s + 1 t 1 ca { 1 s 1 v t LP s 1 s 1 1 1 s 1 / 1 1 s< 1 1 1 cr t CQP Notes. The last column indicates the tractability of (1). S.C., admits selfconcordant barrier. Figure 1: This figure is from Aharon Ben-Tal et al (2015) Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 9 / 25

Proof of Theorem (1) Our constraint can be turned into the following maximization problem. Then we can find the Lagrangian function L and dual objective function g. Just need to worry that min λ,η 0 g(λ, η) β { β max (a + Bp) w m p 0 Cp d, q i φ( p } i ) ρ q i g(λ, η) = max p 0 = max p 0 { (a + Bp) w + ρλ λ i=1 m i=1 q i φ( p } i ) + η (d Cp) q i (4) (5) L(p, λ, η) (6) Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 10 / 25

Proof of Theorem (2) Now we can show the following. g(λ, η) = a w + d η + ρλ + max p 0 = a w + d η + ρλ + = a w + d η + ρλ + = a w + d η + ρλ + m i=1 m i=1 m (p i (b i w) p i(c i η) λq iφ(p i /q i )) i=1 max p i 0 (p i(b i w c i η) λq iφ(p i /q i )) λq i max t 0 (t(b i w c i η)/λ φ(t)) m λq i φ (b i w c i η) i=1 Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 11 / 25

Corollary Theorem If we are just concerned with probability vectors then we can reduce the constraint in equation (2) to the following a w + η + ρλ + λ m ( b q i φ i w η ) β λ i=1 λ 0 (7a) (7b) Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 12 / 25

Application (1) Consider the robust newsvendor model (This is also from Ben-Tal et al. (2013)) max min Q p U m p i u(r(q, i)) (8) i=1 s.t. r(q, i) = v min(d i, Q) + s(q d i ) + l(d i Q) + cq (9) We just have the historical sample frequencies q. So U = {p R m e p = 1, D(p, q) ρ}. Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 13 / 25

Application (2) After applying the theorem and adjusting some things we get { max η ρλ λ Q,η,λ m i=1 q i φ ( u(r(q, d i )) η ) } (10) λ If u(x) is concave and non-increasing and v + l s (wrong in paper) then the problem is still convex Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 14 / 25

Ambiguity and SVM For most applications of SVM, the uncertainty in the data is continuous (w.r.t. Lebesgue measure) and not discrete The divergence of a continuous and discrete distribution is always infinity Maybe we can just get probabilities in the nominal distribution and then turn the discrete nominal distribution into a continuous one Lets try the same strategy as Ben-Tal et al. (2013) but with continuous distributions Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 15 / 25

SVM with Kantorovich metric The Kantorovich metric ρ is defined as follows { } ρ(p, Q) inf d(x, y)k(dx, dy) K Marginals of K are P and Q X (11) Now the distance between a continuous and discrete distribution is not infinity Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 16 / 25

Semi-infinite DR SVM The DR-SVM with uncertainty set U being defined with the Kantorovich metric is defined as follows. 1 min w,b 2 w w + C sup 1 y(x w b)p(dx, dy) (12) P U is equivalent to (Lee and Mehrotra) min w,b,t,u 1 2 w w + C ( 1 m X m ξ j + ηu ) i=1 s.t. ξ j 1 y(x w b) [d(x, x j ) + d y (y, y j )]u, (13a) (x, y) X, j (13b) t, u 0 (13c) Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 17 / 25

Sketch of proof We just consider the second part of (12) and use a fact from Shapiro (2001) that the following is the primal and dual of our problem. max E P[f (x, y)] P U s.t. E P [g i (x, y)] = b i, i = 1,..., t min ξ 0 b x s.t. t ξ i g i (x, y) f (x, y) C i=1 Where C = {f X : X f (x, y)p(dx, dy) 0, P U} Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 18 / 25

KL Divergence Let φ(x) = x log(x). This gives us the Kullback-Leibler Divergence. D(P, P 0 ) = X p(z) log( p(z) )dz (14) p 0 (z) We would like to use this to bound in a robust way the true probability P with the nominal probability P 0 which is given by the data Define U = {P : D(P, P 0 ) η} Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 19 / 25

Result Take our original problem and make it robust min h(w) s.t. min P U Pr P{H(w, x) 0} 1 ɛ then transform it to a tractable problem (Hu and Hong (2013)) where there is no ambiguity min h(w) s.t. Pr P0 {H(w, x) 0} 1 ɛ where ɛ = sup t>0 e η (t+1) ɛ 1 t ɛ Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 20 / 25

Proof (1) Consider the following problem. min max E P[H(w, x)] w W P U By multiplying by p 0(x) p(x) p 0 (x) and letting L(x) = p 0 (x) inner maximization problem to look like this. max E P0 [H(w, x)l(x)] s.t. E P0 [L log(l)] η, then we can change the L L where L = {L E P0 [L] = 1, L 0 a.s.} Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 21 / 25

Proof (2) l(α, L) = E P0 [H(w, x)l(x)] α(e P0 [L(x) log(l(x)) η] If we maximize this under L L and then take the minimum over α 0 then we solve the dual and by the convexity of the problem then they are equal. max E P0 [H(w, x)l(x) αl(x) log(l(x))] s.t. E P0 [L(x)] = 1 L(x) 0 Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 22 / 25

Proof (3) Now we make functionals J(f ) = E P0 [H(w, x)f (x) αf (x) log(f (x))] and J c (f ) = E P0 [L(x)] 1. Now we basically make a unconstrained optimization for these functionals and solve. After some functional analysis you get L (x) = eh(w,x)/α E P0 [e H(w,x)/α ] (15) Plug that back into l(l, α) and you get l(l, α) = v(α) = α log(e P0 [e H(w,x)/α ]) + αη Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 23 / 25

Proof (4) Now just notice that P P0 (H(w, x) 0) = E P0 [1 H(w,x) 0 (x)] and plug it into the above solution and you get the constraint from before. Pr P0 {H(w, x) 0} 1 ɛ e η (t + 1) ɛ 1 ɛ = sup ɛ t>0 t Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 24 / 25

References Aharon Ben-Tal, Dick Den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust solutions of optimization problems affected by uncertain probabilities. Management Science, 59(2): 341 357, 2013. Zhaolin Hu and L Jeff Hong. Kullback-leibler divergence constrained distributionally robust optimization. Available at Optimization Online, 2013. Changhyeok Lee and Sanjay Mehrotra. A distributionally-robust approach for finding support vector machines. Not Yet Published. Alexander Shapiro. On duality theory of conic linear problems. pages 135 165, 2001. Ammon Washburn (University of Arizona) Ambiguity Sets April 22, 2016 25 / 25