Introduction to SVM and RVM Machine Learning Seminar HUS HVL UIB Yushu Li, UIB
Overview Support vector machine SVM First introduced by Vapnik, et al. 1992 Several literature and wide applications Relevance vector machine RVM Introduced by Tipping, M.E, 2001 Few documents, lots of potential research topics Both are kernel based supervised learning Kernel is a key concept in machine learning Both use few key data for classification/regression SVM: few support vectors RVM: few relevance vectors
Support vector machine (SVM) Supervised learning: classification and regression Advantages of support vector machine Can always find a global optimization solution Use few support vector instead of whole dataset Kernel-based mapping tricks to deal with non-linear boundary case Pure data driven, no need for priori assumptions of model structure
Outlines for SVM part Maximal Margin Classifier (MMC) Global solution Support vectors Support Vector Classifier (SVC) Slack variables Support Vector Machine (SVM) Enlarged feature space Kernel trick
Maximal Margin Classifier (MMC)
Training dataset are purely linear separable in MMC Input contains p covariates (p independent variables): X = (X 1, X 2,, X p ) T, output Y belong to one of two classes: -1 or +1 Training dataset of N observations {( y 1, x 1 ),, ( y N, x N )}: the ith observation x i R p : x i = (x i1, x ip ) T and output y i {-1,1}; i =1,2,, N. Let f X = β 0 + β 1 X 1 + + β p X p, we need to construct a linear separating hyperplane f X = 0 as classifier. For new input data x 0 R p, if f (x 0 ) > 0, then predict output as 1 class, else -1 class.
Maximal margin hyperplane (MMH) (Eg. p = 2) X 2 Green points: y = -1, red points: y =1 Margin M: distance from hyperplane to the nearest training dataset. Intuition is, separating hyperplane should be as far away from the data of both classes as possible: larger M, better hyperplane. MMH is the hyperplane which has the largest M (graph D): X 1
Optimization problem Two dashed lines: margin lines To find coefficients β 0, β = (β 1,, β p ) T of MMH, we need to resolve a maximization problem. Rewritten as following minimization problem: min 2 β 2 subject to y i x T i β + β 0 1 0, i = 1,... N β,β 0 1 Can find a global solution! (Reason why we want to construct a linear hyperplane)
Transform to dual problem The lagrangian here is: L(α, β, β 0 ) = 1 2 β 2 N α i [y i (x T i β + β 0 ) 1, i=1 where α = (α 1,..., α N ) T are Lagrange multipliers The previous minimization problem is the same as minimizing following L D (α) with respect to α = (α 1,..., α N ) T (Dual problem) N N L D α = α i 1 2 α i α k y i y k x i T x k i=1 i,k=1 Optimal solution to α = (α 1,..., α N ) T lead to optimal solution to β 0, β = (β 1,, β p )
Inner product < x i, x j > Que: why we want to resolve the dual problem instead of the original optimization problem? Ans: a. Help us to identify the support vectors. b. Inner product < x i, x j > = x i T x j appears, which will be an important element later in non-linearly separable classification problems.
Solve Dual problem The lagrangian L(α, β, β 0 ) = 1 2 β 2 N α i [y i (x T i β + β 0 ) 1 i=1 Step 1 fix α, minimize L(α, β, β 0 ) with respec to β and β 0 : set L = L N N = 0 and get solutions β = α β β i=1 i y i x i, and 0 = α i y i 0 Step 2-- substitute the two equation in (I) back to L and get: N L D α = α i i=1 1 2 N i,k=1 Step 3-- minimize L D (α) with respect to α α i α k y i y k x i T x k i=1 (I) Step 3 can be resolved (global solution) by sequential minimal optimization (SMO) algorithm.
Support vectors (I) When dual problem is resolved, we have that for α i, i = 1,... N: If α i > 0., the corresponding x i lie exactly on the margin lines (dashed line), they are the support vector. If x i is outside the margin lines on the boundary, then α i = 0.
Support vectors (II) N L The original coefficient β = i=1 α i y i x i is then β = l=1 α l y l where x l are support vectors and L is much smaller than N! x l For new input data x 0 : f (x 0 ) =β T x 0 + β 0 = L l=1 α l y l x T l x 0 + β 0 If f (x 0 ) > 0, then predict output as 1, else -1.
Advantages and Disadvantage of MMC Advantages: maximal margin between two classes global solution use few support vectors for prediction Disadvantages: hard margin, training dataset must line outside margin lines. can not seperate noisy data (left graph). not robust to outliers (right graph). Soft margin: Support vector classifier (SVC)
Support Vector Classifier (SVC)
Soft margin in SVC A soft margin can allow that certain points: lie on the incorrect side of margin lines (green ξ * 4 and red ξ * 1, ξ * 2 ). lie at the wrong side of the separating hyperplane (green ξ * 3 and red ξ * 5). Allow maximal D points on the wrong side of hyperplane.
Optimization problem at SVC Constructing the SVC hyperplane is then solving the following optimization problem: ε 1,..., ε N are slack variables: ε i = 0, then the ith observation is on the correct side of margin. ε i > 0, then the ith observation is on the wrong side of the margin. ε i > 1, then the ith observation lies on the wrong side of the hyperplane. Maximal allowed number is D, a hyperparameter chosen by CV.
Support vectors in SVC Only if the ith observation lies on or violate the margin line, then α i > 0. Those observations are support vectors, else α i = 0 For new input x 0, if f (x 0 ) > 0, then predict output as 1, else -1: N f (x 0 ) = i=1 α i y i x T i x 0 + β 0 = l=1 α l y l x T l x 0 + β 0 x l are support vectors and L is much smaller than N L
Support vectors in SVC Especially suitable for large data set.
Linear boundary can not work Disadvantages of SVC: sometime a linear hyperplane just won't work in the original p dimension input space (left, p =2) We can map original data to a higher m dimension feature space where the data are separable with a linear hyperplane (right, m =3 ) Mapping tricks: SVM
Support Vector Machine
Basic idea of SVM (I) Transform (map) the data X = (X 1, X 2,, X p ) T from the original p dimension input space into the m dimension enlarged feature space: X h(x) where h(x) = (h 1 (X), h 2 (X),, h m (X) ) T with m > p. Then find the unique (global solution) optimal hyperplane in the m dimension space Example: p =2, m = 3 with h 1 (X) = X 1 2, h 2 (X) = 2 X 1 X 2, h 3 (X) = X 2 2
Inner products After the transformation, the ith input in the enlarged feature space become a m dimension vector: h(x i ) = (h 1 (x i ), h 2 (x i ),..., h m (x i )) T The dual problem is then: N L D α = α i i=1 1 2 N α i i,k=1 α k y i y k h(x i ) T h(x k ) For new input data x 0 : f (x 0 ) =β T L h x 0 + β 0 = l=1 α l y l h(x l ) T (x 0 ) + β 0 (L much smaller than N) If f (x 0 ) > 0, the predicted output is 1, else -1.
Transformation h(x i ) = (h 1 (x i ), h 2 (x i ),..., h m (x i )) T Instead of computing x i T x j, need to compute h(x i ) T h(x j ) m can be very high and computations of h(x i ) T h(x j ) intractable. are In SVM, we don t actually choose transformation basis functions h(x)= (h 1 (x), h 2 (x),, h m (x)) T Instead we choose a Kernel function so that K(x i,x j )=h(x i ) T h(x j ).
The Kernel Trick A kernel function is some function that corresponds to an inner product in some enlarged feature space. Eg p = 2: x = (x 1 x 2 ) T. Suppose a transformation basis is : h(x) = (1 x 1 2 2 x 1 x 2 x 2 2 2x 1 2x 2 ) T, thus m = 6 The x i =(x i1 x i2 ) T, x j =(x j1 x j2 ) T inner product h(x i ) T h(x j ) is: h(x i ) T h(x j ) = [1 x i1 2 2 x i1 x i2 x i2 2 2x i1 2x i2 ] T [1 x j1 2 2 x j1 x j2 x j2 2 2x j1 2x j2 ] =1+ x i12 x j1 2 + 2 x i1 x j1 x i2 x j2 + x i22 x j2 2 + 2x i1 x j1 + 2x i2 x j2 = (1 + x it x j ) 2 If we define a kernel function K(x i,x j ) = (1 + x it x j ) 2, then there is no need to compute h(x i ) T h(x j ) explicitly.
Examples of Kernel Functions Linear: K(x i,x j )= x i T x j Polynomial of power d: K(x i,x j )= (1+ x i T x j ) d Gaussian (radial-basis function network): xi x K( xi, xj) exp( 2 2 j 2 ) Sigmoid: K(x i,x j )= tanh(β 0 x i T x j + β 1 )!All the kernel function K( ) must satisfy Mercer's condition.
Kernel tricks By using kernel functions, dual problem is from: to: N L D α = i=1 α i 1 2 N L D α = i=1 α i 1 2 For new input data x 0 : N i,k=1 α i α k y i y k h(x i ) T h(x k N i,k=1 α i α k y i y k K(x i,x j ) f (x 0 ) =β T L h x 0 + β 0 = l=1 α l y l K(x l,x 0 ) + β 0 If f (x 0 ) > 0, the predicted output is 1, else -1.
Multiple classification One versus the rest: Training for each class with all the others serving as the non-class training samples Hierarchical Trees - One vs One
SVM regression example Pure data driven No need for assumption of regression model Still use only few support vectors instead of whole data set
Summary of SVM Advantages: Global optimization solution, few support vectors for prediction, fast computation speed, pure data driven, avoid overfitting... Applications:
Disadvantages of SVM Predictions are not probabilistic. Number of support vectors grows steeply with the size of the training set. Need to use cross validation to choose the hyperparameter D and ε (ε is hyperparameter in SVR). The kernel function K( ) must satisfy Mercer's condition.
Revelance vector machine http://www.miketipping.com/sparsebayes.htm Training dataset T ={(y 1, x 1 ),, (y N, x N )} Assume p(y x, w) follows Gaussian distribution N( f (x), σ 2 ): f (x) = N i=1 w i K(x i,x 0 ) + β 0 (II) where w = (w 1,..., w N ) T are weight vectors Based on (II), we can write down likelihood of dataset p(t w ). Given a specific prior distribution p(w), RVM calculate the posterior probability p(w T ) from Bayes rule.
RVM p(w i T) becomes infinitely peaked at zero, and the corresponding i th kernel functions can be 'pruned away. The rest L (L much less than N) number of nonzero weights corresponding the training datapoints which is called relevance vectors For new input x 0, use predictive distribution p(y 0 x 0, T) for further inference/prediction of y 0 : p( y0 x 0, T ) p( y0 x 0, w) p( w T ) dw
RVM vs SVM RVM give out probabilistic prediction p(y 0 x 0, T), SVM just give out point prediction y 0 The number of relevance vectors can be much smaller than that of support vectors. RVM does not need the tuning of a hyperparameter as in SVM during the training phase. The kernel function K( ) do not need to satisfy Mercer's condition.
RVM regression vs SVM regression An example with DGP: y = sinc(x) + N(0, sd) where sd = 0.1 RVR use only 7 relevance vectors while SVR use 29 support vectors. RVR has less error. SVR has to use CV to choose hyperparameter C and ε, while RVR automatically estimate them by learning procedure. However, training phase of RVM typically involves a highly nonlinear optimization process. (Can only find local optimization).