Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs

Similar documents
Interval Data Classification under Partial Information: A Chance-constraint Approach

CSC 411 Lecture 17: Support Vector Machine

Convex Optimization in Classification Problems

A Second order Cone Programming Formulation for Classifying Missing Data

L5 Support Vector Classification

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Support Vector Machines for Classification and Regression

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Short Course Robust Optimization and Machine Learning. Lecture 6: Robust Optimization in Machine Learning

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Convex Optimization and Support Vector Machine

Robust Novelty Detection with Single Class MPM

Stat542 (F11) Statistical Learning. First consider the scenario where the two classes of points are separable.

MSA220/MVE440 Statistical Learning for Big Data

Support Vector Machine

Machine Learning. Support Vector Machines. Manfred Huber

Convex envelopes, cardinality constrained optimization and LASSO. An application in supervised learning: support vector machines (SVMs)

A Robust Minimax Approach to Classification

COMP 875 Announcements

Robust Fisher Discriminant Analysis

Support Vector Machines

Ordinary Least Squares Linear Regression

A Robust Minimax Approach to Classification

Bagging and Other Ensemble Methods

Learning From Data: Modelling as an Optimisation Problem

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (continued)

Statistical Pattern Recognition

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

SVM TRADE-OFF BETWEEN MAXIMIZE THE MARGIN AND MINIMIZE THE VARIABLES USED FOR REGRESSION

Discriminative Models

Support Vector Machines for Classification: A Statistical Portrait

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Robustness and Regularization: An Optimization Perspective

Sequential Minimal Optimization (SMO)

Support Vector Machine (SVM) and Kernel Methods

Warm up: risk prediction with logistic regression

Nearest Neighbors Methods for Support Vector Machines

Semidefinite and Second Order Cone Programming Seminar Fall 2012 Project: Robust Optimization and its Application of Robust Portfolio Optimization

ν =.1 a max. of 10% of training set can be margin errors ν =.8 a max. of 80% of training can be margin errors

Discriminative Models

Estimation and Optimization: Gaps and Bridges. MURI Meeting June 20, Laurent El Ghaoui. UC Berkeley EECS

An Introduction to Machine Learning

Machine Learning A Geometric Approach

Support Vector Machine (SVM) and Kernel Methods

Support vector machines Lecture 4

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Introduction How it works Theory behind Compressed Sensing. Compressed Sensing. Huichao Xue. CS3750 Fall 2011

Support Vector Machines

IEOR 265 Lecture 3 Sparse Linear Regression

Support Vector Machines, Kernel SVM

A Randomized Algorithm for Large Scale Support Vector Learning

Review: Support vector machines. Machine learning techniques and image analysis

Kernel Methods. Machine Learning A W VO

Linear & nonlinear classifiers

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Notes on Discriminant Functions and Optimal Classification

Robust Kernel-Based Regression

CSE546: SVMs, Dual Formula5on, and Kernels Winter 2012

Incorporating detractors into SVM classification

Does Unlabeled Data Help?

Support Vector Machine for Classification and Regression

Statistical Methods for SVM

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

A direct formulation for sparse PCA using semidefinite programming

Lecture Support Vector Machine (SVM) Classifiers

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

arxiv: v1 [stat.ml] 3 Sep 2014

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

A GENERAL FORMULATION FOR SUPPORT VECTOR MACHINES. Wei Chu, S. Sathiya Keerthi, Chong Jin Ong

Learning with multiple models. Boosting.

Kernel Methods and Support Vector Machines

EE 227A: Convex Optimization and Applications April 24, 2008

Jeff Howbert Introduction to Machine Learning Winter

Robust Optimization for Risk Control in Enterprise-wide Optimization

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Least Squares Regression

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

The definitions and notation are those introduced in the lectures slides. R Ex D [h

Support Vector Machines Explained

ECE 5424: Introduction to Machine Learning

Lecture Notes on Support Vector Machine

Homework 3. Convex Optimization /36-725

CS281 Section 4: Factor Analysis and PCA

ML (cont.): SUPPORT VECTOR MACHINES

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

A Randomized Algorithm for Large Scale Support Vector Learning

Lecture 4: Training a Classifier

Least Squares Regression

Max Margin-Classifier

Support Vector Machines

Sparse Linear Models (10/7/13)

Machine Learning Practice Page 2 of 2 10/28/13

Statistical Machine Learning from Data

Support Vector Machines

Stephen Scott.

Transcription:

Second Order Cone Programming, Missing or Uncertain Data, and Sparse SVMs Ammon Washburn University of Arizona September 25, 2015 1 / 28

Introduction We will begin with basic Support Vector Machines (SVMs) or maximum margin algorithms We will introduce missing or uncertain data into the training data We will reformulate the CCPs into a SOCP that we will be able to solve using different information Introduce sparse SVMs and why they are used Slight digression on ν-svms Talk about future research areas 2 / 28

SVMs and MM programs The basic (linear) maximum margin (MM) program is defined by the following optimization problem. 1 min w,b,ξ i 2 w 2 2 + C m i=1 s.t. y i (w x i b) 1 ξ i, ξ i 0, i = 1,..., m ξ i This program finds a hyperplane between the groups of data points and uses that to categorize new data. ξ i is the penalty used if a data point is moved to the right side while C is some heuristic constant. w and the margin between the groups has an inverse relation. 3 / 28

Missing or Uncertain data When dealing with missing or uncertain data we reformulate the problem to Chance Constrained Program or CCP: 1 min w,b,ξ i 2 w 2 2 + C ) s.t. Pr (y i (w X i b) 1 ξ i 1 ɛ, ξ i 0, i = 1,..., m m i=1 ξ i Generally untractable even if the underlying probability distributions of the X i are known Want to find stronger and easier convex conditions Also fulfill for any probability distribution Robust means the worst case distribution 4 / 28

Using just Support in Robust Formulation Suppose we know the support of each variable, i.e. x i {x : D i x d i } If ɛ = 0 then in the robust formulation we pick the worst point(s) and proceed as in the original formulation This ensures for no misclassification of the training data The formulation is [4]: s.t. min w,b 1 2 w 2 min y i(w x b) 1, {x:d i x d i } i = 1,..., n 5 / 28

1 A 6 / 28

Transductive SVMs Introduction the training data set D = {(x i, c i ) x i R p, c i { 1, 1}} n i=1, a test data set to classify D = {x j x j R p } m j=1, 7 / 28

Transductive SVMs Introduction The optimization model for transductive SVM can be formulated as min w,b,c j 1 2 wt w s.t. c i (w T x i b) 1, i = 1,, n c j (w T x j b) 1, c j { 1, 1}, j = 1,, m where the decision variable cj in the test data set. is used to classify the point x j 7 / 28

Using Second Moments to reformulate it to a SOCP A second order cone program (SOCP) is a program formulated as the following: min f x s.t. A i x + b i 2 c i x + d i, i = 1,..., m Fx = g If A i = 0 for all i then it reduces to a linear program If c i = 0 for all i then it reduces to a quadratic program This can be formulated as a semi-definite program and can be solved using those methods Recently interior-point methods have come out that take advantage of SOCP directly 8 / 28

Multivariate Chebyshev Inequality The Multivariate Chebyshev inequality will allow us to put some bounds on the probability of being misclassified using the mean and variance of the data sup Pr(y S) = (1+d 2 ) 1, y (ȳ,σ) where d 2 = inf y S (y ȳ) Σ(y ȳ) S is our convex set over which we care about In SVM it is on one side of the hyperplane This holds for all distributions having the same mean and covariance 9 / 28

Robust Formulation Introduction ( ) inf Pr y i (w X i b) 1 ξ i 1 ɛ X i ( x,σ) We take the worst case distribution having our mean and covariance This is the robust formulation In order to use Chebyshev inequality we reformulate it as follows: ) sup Pr (y i (w X i b) 1 ξ i ɛ X i ( x,σ) 10 / 28

Plugging what we know we get the following inequality: ɛ (1 + d 2 ) 1 d 2 = inf (x x i ) Σ 1 (x x i ) x y i (w x b) 1 ξ i If the mean x i happens to lie on the hyperplane (or wrong side of the hyperplane) then in the worst case scenario you have a 100 percent chance of misclassifying the data Just move the hyperplane with penalty ξ i Otherwise it is the distance from the hyperplane to the mean d 2 = y i(w x i b) 1 + ξ i w Σw 11 / 28

Theorem of CCP Introduction Now we have the following theorem [5]. Theorem The classification problem with uncertainity or CCP is satisfied for all probability distributions having the same mean and covariance with the following second order cone program min w,b,ξ 1 2 w 2 + C n i=1 ξ i s.t. y i (w x 1 ɛ i b) 1 ξ i + Σ 1 2 w ɛ ξ i 0, 1 i n 12 / 28

Reformulation to a SOCP SOCP need to have a linear objective function Replace 1 2 w 2 with constraint that w W will give you the same answer if you tune C and W right Packages that have methods to solve SOCPs are AMPL, CPLEX, ECOS, Gurobi, JOptimizer, MOSEK, OpenOpt, SDPT3, and Xpress Can also use semi-definite programming methods to solve it 13 / 28

Incorporating more (or less) information Some problems with the last formulation Assumed we knew the means and covariances (More likely we have an estimate for the means and covariances) Didn t allow us to include support of the variable into the model (Want to include all the information we know) Sometimes we only know the support of the means and covariances 14 / 28

Incorporating all our information [1] Theorem Assume we know the support (l ij X ij u ij ), bounds on first moments (µ ij µ ij µ + ij ) and bounds on the second moments (0 E[X ij ] σij 2) of independent random variables X ij, j = 1,..., n are known. Then our CCP constraint is satisfied if the following convex constraint is satisfied: 1 ξ i + y i b + j (max[ y i µ ij w j, y i µ + ij w j]) + κ Σ (1),i w 0 Note κ = 2 log( 1 ɛ ) and Σ (1),i = diag ([ s i1 ν(µ i1, µ+ i1, σ i1),..., s in ν(µ in, µ+ in, σ in) ]) where ν(µ ij, µ+ ij, σ ij) will be defined later 15 / 28

Key Ideas from the Proof Consider a i0 = 1 ξ + y i b and a i = y i w. Then we can rewrite our CCP constraint as Now use that Pr(a i X i + a i0 0) ɛ (3) Pr(a i X i + a i0 0) = Pr(e αa i X i e αa i0 1), α 0 The Markov inequality Pr(X a) E(X ) a for non-negative random variables X ij for j = 1,..., n are independent We get the following inequality Pr(a i X i + a i0 0) e αa i0 E[e αa ij X ij ] (4) j 16 / 28

Key Ideas from the Proof We have now turned our random variables into non-negative random variables We use several bounds (from other papers), AM-GM inequality, and a Taylor series approximation to get the right convex conditions No intuition, just slugging away at the calculations We can find similar convex conditions for different kinds of information Support information and bounds for first and second moments (last theorem) Support information and exact values for first and second moments Same two as above but assume you don t know the second moments 17 / 28

Sparse SVMs Introduction The basic sparse linear SVM is exactly the same as before but we now use the l 1 norm in R n [2, 3]. min w 1 + C w,b,ξ i m i=1 s.t. y i (w x i b) 1 ξ i, ξ i 0, i = 1,..., m The most sparse norm is l 0 which counts the number of non-zero entries This norm isn t continuous so l 1 is next best (1, 0) and ( 1 2, 1 2 ) both have norm 1 in l 2 but ( 1 2, 1 2 ) has norm 2 in l 1 ξ i 18 / 28

Full LP sparse ν-svm min w 1 νρ + ρ,w,b,ξ i m i=1 ξ i s.t. y i (w x i b) ρ ξ i ξ i, ρ 0 i = 1,..., m i = 1,..., m ν has three properties which make it better than using C 1 It is an upper bound on the fraction number of margin errors (points x i with ξ i > 0) or ME m 2 It is a lower bound on the fraction of support vectors (points on the boundary) or SV m 3 If the data is drawn i.i.d. from a distribution then asymptotically with probability one, ν is the fraction of margin error and SVs 19 / 28

Key ideas behind ν-svm This will give the same answer as C-SVM with C = 1 ρ ν is a more intuitive parameter and keeps the same value even if you change dimensions or add data points There is an extra decision variable but since C is so heuristic then it is about the same 20 / 28

Benefits of Sparse SVM In big data problems, there are thousands of dimensions but really only a couple have actual predicting power Let your algorithm pick the dimensions that matter If you have thousands of dimensions but just a few data points then a sparse SVM is essential to avoid over-fitting (Genetics) Also using l 1 means the problem can be reformulated as a linear program (LP) 21 / 28

How to Enforce Sparseness Though l 1 is sparser than l 2, we would like it to be even sparser. We can do this by reducing the dimensions in the following ways Decide an arbitrary cut-off that gets rid of features (dimensions) with small weights or too large of standard deviation (pre-processing) Introduce arbitrary features (dimensions) which have no say on the categories (Draw them from normal with mean zero) and use the average of their weights as the cutoff Use several random subsets of the features (dimensions) and then bag the models together or bootstrap aggregation. These leads to less variance and less over-fitting (for unstable models) 22 / 28

Adding uncertainity into a Sparse Model The convex uncertain SVM models before didn t depend on the norm used. So just add that in. n min w 1 + C ξ i w,b,ξ i=1 s.t. y i (w x 1 ɛ i b) 1 ξ i + Σ 1 2 2 w ɛ ξ i 0, 1 i n Would it be possible to add ν into this equation to get rid of C? 23 / 28

Possible new model Introduction Putting together the ideas from before we could make the full sparse robust ν-svm as follows: m min w 1 νρ + ξ i ρ,ξ i,w,b i=1 s.t. y i (w x 1 ɛ i b) ρ ξ i + ɛ ξ i, ρ 0 Σ 1 2 i w 2 i = 1,..., m i = 1,..., m Not clear what the ν represents with the uncertainty Will this even get what we want? It it doesn t, how could we change this to get similar ideas as before 24 / 28

Other possible regularization functions We have talked about l 2 and l 1 as a regularizing term. What about other regularizations? If we look at l n as n increases, we get less sparse solutions If we look at l n for 0 < n < 1 then these are no longer normed spaces (or n is not a norm). However, does increase sparsity The idea behind LASSO or least absolute shrinkage and selection operator is just to use the l 1 norm for linear regression. Nothing new added SCAD or Smoothly Clipped Absolute Deviation regularization is the following function: λ w j w j λ ( w p λ (w j ) = j 2 2aλ w j +λ 2 2(1 a) λ < w j aλ (a+1)λ 2 2 w j > aλ 25 / 28

Future Work Introduction Add ν somehow into the Robust Sparse SVM with uncertainty Analyze the changes with different regularization Add multiple classes to an SVM Take these ideas to SVR (Support vector regression) 26 / 28

References I Introduction Aharon Ben-Tal, Sahely Bhadra, Chiranjib Bhattacharyya, and J Saketha Nath. Chance constrained uncertain classification via robust optimization. Mathematical programming, 127(1):145 173, 2011. Chiranjib Bhattacharyya, LR Grate, Michael I Jordan, L El Ghaoui, and I Saira Mian. Robust sparse hyperplane classifiers: application to uncertain molecular profiling data. Journal of Computational Biology, 11(6):1073 1089, 2004. 27 / 28

References II Introduction Jinbo Bi, Kristin Bennett, Mark Embrechts, Curt Breneman, and Minghu Song. Dimensionality reduction via sparse support vector machines. The Journal of Machine Learning Research, 3:1229 1243, 2003. Neng Fan, Elham Sadeghi, and Panos M Pardalos. Robust support vector machines with polyhedral uncertainty of the input data. pages 291 305, 2014. Pannagadatta K Shivaswamy, Chiranjib Bhattacharyya, and Alexander J Smola. Second order cone programming approaches for handling missing and uncertain data. The Journal of Machine Learning Research, 7:1283 1314, 2006. 28 / 28