Indirect Rule Learning: Support Vector Machines
Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection learning aims to directly minimize an empirically approximate expected loss function. Most often, it minimizes (empirical risk minimization) L(Y i, f (X i )). For example, least squares estimation: (Y i f (X i )) 2. classification problem: I(Y i f (X i )).
Potential challenges What is a good approximation for expected loss function? Empirical risk is most commonly used but there are other alternatives. What is the choice of candidate f for optimization? How to avoid overfitting? Will computation be feasible? global minimizer computation complexity
Least squares estimation The empirical risk is (Y i f (X i )) 2. f (x) can be from a class of linear functions; a sieve space of basis functions (splines, wavelets, radial basis); or fully nonparametric (kernel estimation). Overfitting can be addressed using regularization: variable selection for linear models; penalized splines, shrinkage for sieve approximation cross-validation for tuning parameter selection. Computation convex optimization co-ordinate decent optimization for large p
Support Vector Machines Consider binary classification problem and use label { 1, 1} for two classes. We start from a simple classification rule which is a linear function of feature variables X. The idea of SVM is to identify a hyperplace on feature space that separates classes as much as possible.
SVM illustration
Mathematical formulation of SVM The goal is to find a hyperplane β 0 + X T β such that for all i = 1,..., n. Y i (β 0 + X T i β) > 0 Furthermore, we wish to maximize the margin given as M. That is, we solve max M subject to Y i(β 0 + Xi T β) M. β =1
Equivalent optimization It is equivalent to min 1 2 β 2 subject to Y i (β 0 + Xi T β) 1. There are two difficulties in practice classes may not be separable so no solution exists; classes may be separable but separation is nonlinear.
Extension to imperfect separation data For imperfect separation, Y i (β 0 + Xi T β) may not be positive, i.e., the prediction is wrong. We should allow this misclassification but impose some penalty for wrong prediction. This can be done by introducing slack variables ξ 1,..., ξ n for each subject. ξ i 0 describes the distance off the correct classification given by margins. However, we should restrict the total penalty to be not too large.
SVM optimization The optimization is max β =1 M, subject to Y i(β 0 + X T i β) M(1 ξ i), i = 1,..., n. where ξ i 0 and ξ i a pre-specified constant. Equivalently, min 1 2 β 2 + C ξ i subject to Y i (β 0 + Xi T β) (1 ξ i), ξ i 0, i = 1,..., n, where C is a given constant (called cost parameter). This is a convex minimization problem with linear constraints.
Solve SVM problem using duality The Lagrange (primal) function is 1 2 β 2 +C ξ i ] α i [Y i (β 0 + Xi T β) (1 ξ i) where α i 0, µ i 0 are the Lagrange multipliers. Differentiate with respect to β 0, β and ξ i to obtain µ i ξ i, β = α i Y i X i, 0 = α i Y i, α i = C µ i, i = 1,..., n.
Dual problem After plugging β into the primal function and using the equations, the dual objective function is L D = α i 1 2 α i α j Y i Y j Xi T X j. j=1 The dual problem becomes max L D subject to 0 α i C, i = 1,..., n, α i Y i = 0. Furthermore, KKT conditions give ] α i [Y i (Xi T β + β 0) (1 ξ i ) = 0, µ i ξ i = 0, Y i (X T i β + β 0) (1 ξ i ) 0.
KKT conditions
On SVM optimization Solving the dual problem is a simple convex quadratic programming problems (there are many solvers in all packages). Since β = n α iy i X i, the hyperplane is determined by those observations with α i 0, called support vectors. Among support vectors, some are on the margin edges (ξ i = 0) and the remainders have α i = C. Any support vectors with ξ i = 0 can be used to solve for β 0 (often taken to be the average if there are multiple). Sometimes β 0 can be solved by directly minimize the primal function.
Illustrative example
Go beyond linear SVM The most commonly used nonlinear prediction rule is to restrict f in a RKHS, H K (called kernel trick). Recall that RKHS is given by a kernel function K(x, y) where K(x, y) has an eigen-expansion K(x, y) = γ k φ k (x)φ k (y), k=1 where φ k / γ k is the normalized basis function for {H K, <, > HK }. We can represent f (x) using these basis functions f (x) = β 0 + β k φ k (x)/ γ k. k=1
Dual problem with kernel trick Follow the same derivation as linear SVM (replace X i by the vector (φ 1 (X i )/ γ 1, φ 2 (X i ) γ 2,...) T then the dual objective function becomes α i 1 2 α i α j Y i Y j ( φ k (X i )φ k (X j )/γ k ) j=1 k=1 = α i 1 2 α i α j Y i Y j K(X i, X j ). j=1 The prediction function becomes f (x) = β 0 + α i Y i φ k (X i )φ k (x)/γ k = k=1 α i Y i K(X i, x).
Advantage of kernel trick Our conclusions are (a) restricting f to H K leads to a nonlinear prediction function depending on the kernel function; (b) solving the dual problem for the prediction function only need to know the kernel function K(x, y) (not necessarily the basis functions); (c) the optimization in the dual problem depends on the number of observations (n) but not the dimensionality of X i s.
Choice of the kernel functions polynomial kernel: K(x, x ) = (1 + x T x ) d Radial basis or Gaussian kernel: K(x, x ) = exp{ γ x x 2 } Neural network: K(x, x ) = tanh(k 1 x T x + k 2 ).
Revisit SVM example
Loss formulation for SVM Revisit linear SVM formulation: we minimize β subject to separation constraints Y i (β 0 + X T i β) 1 ξ i, ξ i 0, i = 1,.., n and that n ξ i is controlled by a constant. We need to understand what exact empirical loss this optimization tries to minimize because by achieving this, we can characterize how SVM will possibly minimize classification loss (Fisher consistency); we can study the stochastic variability of the SVM classification (convergence rate and risk bound).
Loss formulation: continue Equivalently, we minimize (for a given constant C) 1 2 β 2 + C ξ i subject to ξ i [1 Y i (β 0 + X T i β)] +, where (1 z) + = max(1 z, 0). Hence, SVM is equivalent minimizing the following loss [1 Y i (β 0 + Xi T β)] + + λ 2 β 2. For nonlinear SVM, the loss is [1 Y i f (X i )] + + λ 2 f 2 H K. We name L(y, f ) = [1 yf ] + the hinge loss function.
Plot of the hinge loss
Fisher consistency of SVM Fisher consistency: suppose f minimizes E[(1 Yf (X)) + ]. Then sign(f (x)) is the Bayes rule for the classification problem. Proof: note that E[(1 Yf (X)) + X = x] = (1 f (x))+ P(Y = 1 X = x) +(1 + f (x)) + P(Y = 1 X = x), as a function of f (x), is a 3-piece of linear function, decreasing in (, 1], linear in ( 1, 1] and increasing in [1, ). The minimize is attained at f (x) = 1 if P(Y = 1 X = x) < P(Y = 1 X = 1) and 1 otherwise.. We conclude the Fisher consistency.
Extension of the hinge loss The hinge loss is one special case of the so-called large margin loss with form φ(yf ) for some convex function of φ. Additional examples include Binomial deviance: log(1 + e yf ) Squared loss: (1 yf ) 2 square hinge loss: (1 yf ) 2 + AdaBoost: exp{ yf }. A sufficient condition for the Fisher consistency is that φ(x) is differentiable at 0 and φ (0) < 0.
SVM for regression Extension of SVM to continuous Y is based on modification of the loss in SVM. Consider prediction f (X) for subject with feature X and Y is his/her true outcome. The inaccuracy of the prediction can be characterized as the so-called ɛ-insensitive loss: L(Y, f (X)) = max( Y f (X) ɛ, 0). The loss is zero if the prediction error is within ɛ.
ɛ-insensitive loss
Optimization problem in SVM for regression The objective function for linear prediction is min β 2 /2 + C{ξ i + ξ i } subject to ξ i ɛ Y i (β 0 +Xi T β) ɛ+ξ i, ξ i 0, ξ i 0, i = 1,..., n. The dual problem is min ɛ (α i +α i ) Y i (α i α i )+1 2 subject to (α i α i )(α j α j )XT i X j j=1 0 α i, α i C, (α i α i ) = 0, α iα i = 0 The prediction function is β 0 + X T β with β = (α i α i )X i.