Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs

Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs Ammon Washburn University of Arizona September 25, 2015 1 / 28

Introduction We will begin with basic Support Vector Machines (SVMs) or maximum margin algorithms We will introduce missing or uncertain data into the training data We will reformulate the CCPs into a SOCP that we will be able to solve using different information Introduce sparse SVMs and why they are used Slight digression on ν-svms Talk about future research areas 2 / 28

SVMs and MM programs The basic (linear) maximum margin (MM) program is defined by the following optimization problem. 1 min w,b,ξ i 2 w 2 2 + C m i=1 s.t. y i (w x i b) 1 ξ i, ξ i 0, i = 1,..., m ξ i This program finds a hyperplane between the groups of data points and uses that to categorize new data. ξ i is the penalty used if a data point is moved to the right side while C is some heuristic constant. w and the margin between the groups has an inverse relation. 3 / 28

Missing or Uncertain data When dealing with missing or uncertain data we reformulate the problem to Chance Constrained Program or CCP: 1 min w,b,ξ i 2 w 2 2 + C ) s.t. Pr (y i (w X i b) 1 ξ i 1 ɛ, ξ i 0, i = 1,..., m m i=1 ξ i Generally untractable even if the underlying probability distributions of the X i are known Want to find stronger and easier convex conditions Also fulfill for any probability distribution Robust means the worst case distribution 4 / 28

Using just Support in Robust Formulation Suppose we know the support of each variable, i.e. x i {x : D i x d i } If ɛ = 0 then in the robust formulation we pick the worst point(s) and proceed as in the original formulation This ensures for no misclassification of the training data The formulation is [4]: s.t. min w,b 1 2 w 2 min y i(w x b) 1, {x:d i x d i } i = 1,..., n 5 / 28

1 A 6 / 28

Transductive SVMs Introduction the training data set D = {(x i, c i ) x i R p, c i { 1, 1}} n i=1, a test data set to classify D = {x j x j R p } m j=1, 7 / 28

Transductive SVMs Introduction The optimization model for transductive SVM can be formulated as min w,b,c j 1 2 wt w s.t. c i (w T x i b) 1, i = 1,, n c j (w T x j b) 1, c j { 1, 1}, j = 1,, m where the decision variable cj in the test data set. is used to classify the point x j 7 / 28

Using Second Moments to reformulate it to a SOCP A second order cone program (SOCP) is a program formulated as the following: min f x s.t. A i x + b i 2 c i x + d i, i = 1,..., m Fx = g If A i = 0 for all i then it reduces to a linear program If c i = 0 for all i then it reduces to a quadratic program This can be formulated as a semi-definite program and can be solved using those methods Recently interior-point methods have come out that take advantage of SOCP directly 8 / 28

Multivariate Chebyshev Inequality The Multivariate Chebyshev inequality will allow us to put some bounds on the probability of being misclassified using the mean and variance of the data sup Pr(y S) = (1+d 2 ) 1, y (ȳ,σ) where d 2 = inf y S (y ȳ) Σ(y ȳ) S is our convex set over which we care about In SVM it is on one side of the hyperplane This holds for all distributions having the same mean and covariance 9 / 28

Robust Formulation Introduction ( ) inf Pr y i (w X i b) 1 ξ i 1 ɛ X i ( x,σ) We take the worst case distribution having our mean and covariance This is the robust formulation In order to use Chebyshev inequality we reformulate it as follows: ) sup Pr (y i (w X i b) 1 ξ i ɛ X i ( x,σ) 10 / 28

Plugging what we know we get the following inequality: ɛ (1 + d 2 ) 1 d 2 = inf (x x i ) Σ 1 (x x i ) x y i (w x b) 1 ξ i If the mean x i happens to lie on the hyperplane (or wrong side of the hyperplane) then in the worst case scenario you have a 100 percent chance of misclassifying the data Just move the hyperplane with penalty ξ i Otherwise it is the distance from the hyperplane to the mean d 2 = y i(w x i b) 1 + ξ i w Σw 11 / 28

Theorem of CCP Introduction Now we have the following theorem [5]. Theorem The classification problem with uncertainity or CCP is satisfied for all probability distributions having the same mean and covariance with the following second order cone program min w,b,ξ 1 2 w 2 + C n i=1 ξ i s.t. y i (w x 1 ɛ i b) 1 ξ i + Σ 1 2 w ɛ ξ i 0, 1 i n 12 / 28

Reformulation to a SOCP SOCP need to have a linear objective function Replace 1 2 w 2 with constraint that w W will give you the same answer if you tune C and W right Packages that have methods to solve SOCPs are AMPL, CPLEX, ECOS, Gurobi, JOptimizer, MOSEK, OpenOpt, SDPT3, and Xpress Can also use semi-definite programming methods to solve it 13 / 28

Incorporating more (or less) information Some problems with the last formulation Assumed we knew the means and covariances (More likely we have an estimate for the means and covariances) Didn t allow us to include support of the variable into the model (Want to include all the information we know) Sometimes we only know the support of the means and covariances 14 / 28

Incorporating all our information [1] Theorem Assume we know the support (l ij X ij u ij ), bounds on first moments (µ ij µ ij µ + ij ) and bounds on the second moments (0 E[X ij ] σij 2) of independent random variables X ij, j = 1,..., n are known. Then our CCP constraint is satisfied if the following convex constraint is satisfied: 1 ξ i + y i b + j (max[ y i µ ij w j, y i µ + ij w j]) + κ Σ (1),i w 0 Note κ = 2 log( 1 ɛ ) and Σ (1),i = diag ([ s i1 ν(µ i1, µ+ i1, σ i1),..., s in ν(µ in, µ+ in, σ in) ]) where ν(µ ij, µ+ ij, σ ij) will be defined later 15 / 28

Key Ideas from the Proof Consider a i0 = 1 ξ + y i b and a i = y i w. Then we can rewrite our CCP constraint as Now use that Pr(a i X i + a i0 0) ɛ (3) Pr(a i X i + a i0 0) = Pr(e αa i X i e αa i0 1), α 0 The Markov inequality Pr(X a) E(X ) a for non-negative random variables X ij for j = 1,..., n are independent We get the following inequality Pr(a i X i + a i0 0) e αa i0 E[e αa ij X ij ] (4) j 16 / 28

Key Ideas from the Proof We have now turned our random variables into non-negative random variables We use several bounds (from other papers), AM-GM inequality, and a Taylor series approximation to get the right convex conditions No intuition, just slugging away at the calculations We can find similar convex conditions for different kinds of information Support information and bounds for first and second moments (last theorem) Support information and exact values for first and second moments Same two as above but assume you don t know the second moments 17 / 28

Sparse SVMs Introduction The basic sparse linear SVM is exactly the same as before but we now use the l 1 norm in R n [2, 3]. min w 1 + C w,b,ξ i m i=1 s.t. y i (w x i b) 1 ξ i, ξ i 0, i = 1,..., m The most sparse norm is l 0 which counts the number of non-zero entries This norm isn t continuous so l 1 is next best (1, 0) and ( 1 2, 1 2 ) both have norm 1 in l 2 but ( 1 2, 1 2 ) has norm 2 in l 1 ξ i 18 / 28

Full LP sparse ν-svm min w 1 νρ + ρ,w,b,ξ i m i=1 ξ i s.t. y i (w x i b) ρ ξ i ξ i, ρ 0 i = 1,..., m i = 1,..., m ν has three properties which make it better than using C 1 It is an upper bound on the fraction number of margin errors (points x i with ξ i > 0) or ME m 2 It is a lower bound on the fraction of support vectors (points on the boundary) or SV m 3 If the data is drawn i.i.d. from a distribution then asymptotically with probability one, ν is the fraction of margin error and SVs 19 / 28

Key ideas behind ν-svm This will give the same answer as C-SVM with C = 1 ρ ν is a more intuitive parameter and keeps the same value even if you change dimensions or add data points There is an extra decision variable but since C is so heuristic then it is about the same 20 / 28

Benefits of Sparse SVM In big data problems, there are thousands of dimensions but really only a couple have actual predicting power Let your algorithm pick the dimensions that matter If you have thousands of dimensions but just a few data points then a sparse SVM is essential to avoid over-fitting (Genetics) Also using l 1 means the problem can be reformulated as a linear program (LP) 21 / 28

How to Enforce Sparseness Though l 1 is sparser than l 2, we would like it to be even sparser. We can do this by reducing the dimensions in the following ways Decide an arbitrary cut-off that gets rid of features (dimensions) with small weights or too large of standard deviation (pre-processing) Introduce arbitrary features (dimensions) which have no say on the categories (Draw them from normal with mean zero) and use the average of their weights as the cutoff Use several random subsets of the features (dimensions) and then bag the models together or bootstrap aggregation. These leads to less variance and less over-fitting (for unstable models) 22 / 28

Adding uncertainity into a Sparse Model The convex uncertain SVM models before didn t depend on the norm used. So just add that in. n min w 1 + C ξ i w,b,ξ i=1 s.t. y i (w x 1 ɛ i b) 1 ξ i + Σ 1 2 2 w ɛ ξ i 0, 1 i n Would it be possible to add ν into this equation to get rid of C? 23 / 28

Possible new model Introduction Putting together the ideas from before we could make the full sparse robust ν-svm as follows: m min w 1 νρ + ξ i ρ,ξ i,w,b i=1 s.t. y i (w x 1 ɛ i b) ρ ξ i + ɛ ξ i, ρ 0 Σ 1 2 i w 2 i = 1,..., m i = 1,..., m Not clear what the ν represents with the uncertainty Will this even get what we want? It it doesn t, how could we change this to get similar ideas as before 24 / 28

Other possible regularization functions We have talked about l 2 and l 1 as a regularizing term. What about other regularizations? If we look at l n as n increases, we get less sparse solutions If we look at l n for 0 < n < 1 then these are no longer normed spaces (or n is not a norm). However, does increase sparsity The idea behind LASSO or least absolute shrinkage and selection operator is just to use the l 1 norm for linear regression. Nothing new added SCAD or Smoothly Clipped Absolute Deviation regularization is the following function: λ w j w j λ ( w p λ (w j ) = j 2 2aλ w j +λ 2 2(1 a) λ < w j aλ (a+1)λ 2 2 w j > aλ 25 / 28

Future Work Introduction Add ν somehow into the Robust Sparse SVM with uncertainty Analyze the changes with different regularization Add multiple classes to an SVM Take these ideas to SVR (Support vector regression) 26 / 28

References I Introduction Aharon Ben-Tal, Sahely Bhadra, Chiranjib Bhattacharyya, and J Saketha Nath. Chance constrained uncertain classification via robust optimization. Mathematical programming, 127(1):145 173, 2011. Chiranjib Bhattacharyya, LR Grate, Michael I Jordan, L El Ghaoui, and I Saira Mian. Robust sparse hyperplane classifiers: application to uncertain molecular profiling data. Journal of Computational Biology, 11(6):1073 1089, 2004. 27 / 28

References II Introduction Jinbo Bi, Kristin Bennett, Mark Embrechts, Curt Breneman, and Minghu Song. Dimensionality reduction via sparse support vector machines. The Journal of Machine Learning Research, 3:1229 1243, 2003. Neng Fan, Elham Sadeghi, and Panos M Pardalos. Robust support vector machines with polyhedral uncertainty of the input data. pages 291 305, 2014. Pannagadatta K Shivaswamy, Chiranjib Bhattacharyya, and Alexander J Smola. Second order cone programming approaches for handling missing and uncertain data. The Journal of Machine Learning Research, 7:1283 1314, 2006. 28 / 28