Support Vector Machine Natural Language Processing Lab lizhonghua
Support Vector Machine Introduction Theory SVM primal and dual problem Parameter selection and practical issues Compare to other classifier Conclusion and discussion
Some Notation: Training data Introduction generated by sampling from an unknown underlying distribution P(x, y)
Introduction Which one is better?
Theory the right one is better, why? Large margin require small Small Small VC Dimension of Margin Hyperplanes ([1] Theorem 5.5 ) Small VC Dimension lower true error bound
Theory Large margin require small w The distance between the two parallel hyperplanes is 2/ w
Theory
Theory Small VC Dimension leads to lower true error Bound With a probability at least the above inequality established, L is the size of the training example,h is the VC dimension
Derive a VC Bound Given a fixed function f,for each example the loss is either 0 or 1 all examples are drawn independently,so are independently sampled from a random variable Chernoff Bound
Derive a VC Bound For all f in F Set =
Derive a VC Bound The cardinality of F the number of function form F that can be distinguished from their values on{x1,x2...x2m} it is the number of different outputs(y1,y2...y2m) that the functions in F can achieve on samples of a given size.
Derive a VC Bound
VC dimention The VC dimension is a property of a set of functions{f(a)}. If a given set of l points can be labeled in all 2^l ways,and for each labeling, a member of the set {f(a)} can be found which correctly assigns those labels, we say that that set of points is shattered by that set of functions. The VC dimension for the set of functions{f(a)} Is defined as the maximum number of training points that can be shattered by {f(a)}.
VC dimention
Derive a VC Bound The capacity term is a property of the function class of F, thus the bound can t be minimized over choice of f. We introduce structure on F and minimize the bound over the choice of the structure. Structural risk minimization!!
SVM primal and dual problem ulinear Support Vector Machines The Separable Case The Non-Separable Case u Nonlinear Support Vector Machines
The Separable Case
Dual Problem : The Separable Case Decision function :
The Non-Separable Case
The Non-Separable Case the standard approach is to allow the fat decision margin to make a few mistakes (some points - outliers or noisy examples - are inside or on the wrong side of the margin). We then pay a cost for each misclassified example, which depends on how far it is from meeting the margin requirement. To implement this, we introduce slack variables.
The Non-Separable Case Have an algorithm which can tolerate a certain fraction of outliers. Introduce slack variables Use relaxed constraints Object function
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines
Nonlinear Support Vector Machines Some Common Kernels : What conditions should the K satisfy? that the K can be a kernel function
Nonlinear Support Vector Machines Kernel Matrix The element of the matrix is the innerproduct of the training examples,and we use the kernel function to get the innerproduct.
Nonlinear Support Vector Machines If the kernel matrix is positive semi-definite for finite examples,we say that the K satisfies the finitely positive semi-definite property, then K can be a kernel function.
Parameter selection Train the SVM,we can find W,b The SVM have hyperparameters: the soft margin constant C,width of Gaussian kernel, degree of polynomial kernel.
soft margin constant C
soft margin constant C when C is small, it is easy to account for some data points with the use of slack variables and to have a fat margin placed so it models the bulk of the data. as C becomes large, it is better to respect the data at the cost of reducing the geometric margin, and the complexity of the function class increases.
degree of polynomial kernel The lowest degree polynomial is linear kernel,it s not sufficient when a non-linear relationship between the two class. Degree-2 is enough. Degree-5 with greater curvature.
width of Gaussian kernel Small gamma leads to smooth boundary, big gamma leads to greater curvature of the decision boundary. Gamma=100 leads to overfitting the data.
A simple procedure Chih-Jen Lin Support Vector Machines at Machine Learning Summer School 2006
Compare to other classifiers Decision Tree trends to overfit Naïve Bayes Classifier feature independent, data sparse SVM kernel selection/design
Conclusion and Discussion Intuitive Has linear or non-linear decision boundaries Does not make unreasonable assumption about the data Does not overfit Does not have lots of parameters,easy to model and train
References [1] Bernhard Scholkopf,Alexander J. Smola. Learning with Kernels [2]CHRISTOPHER J.C. BURGES A tutorial on Support Vector Machine for Pattern Recognition. Data Mining and Knowledge Discovery,2, 121-167(1998) [3] Asa Ben-Hur, Jason Weston. A User s Guide to Support Vector Machines [4] Chih-Jen Lin Support Vector Machines at Machine Learning Summer School 2006
Thanks for your attention!