Machine Learning 2010 Michael M Richter Support Vector Machines Email: mrichter@ucalgary.ca 1 -
Topic This chapter deals with concept learning the numerical way. That means all concepts, problems and decisions are numerically coded. As a consequence, one deals with sets of numbers. Support vector machine are a special technique for this. -
Classification in the R n Situation: Objects are coded as points in the R n ; the learning domain X is a subset of the R n. There are two classes, denoted by the values {+1, -1} Examples are therefore of the form (x i,b i ) with x i = (x i1,...,x in ) X und b i {+1,-1}: Task: generate a hypothesis h for describing a classifier First approach: 3 - Assumption: positive and negative examples are linearly separable, i.e. can be separated by a linear function Then h has to be a hyperplane in X that separates the positive and the negative examples
Example (1) x i2 w n = 2 The linear function h separates +1 the examples Description for h: w,z + d = 0 i.e. h = { z w,z + d = 0 } -d / w -1 Scalar product (inner product): h x i1 w,z = w 1 z 1 + w 2 z 2 (= w z cos(angle[w,z]) ) 4 -
Example (2) x i2 w n = 2 h = { x w,z + d = 0 } suppose x is a point not +1 yet classified Then: classify x as +1, if w,x + d 0-1, if w,x + d < 0 -d / w -1 h x i1 The classification with b {+1,-1} is correct for x iff b ( w,x + d) 0 5 -
Goal Find a decision plane that separates between a set of objects having different memberships. Minimize the empirical classification errors and maximize the geometric margin at the same time. 6 -
Hyper Planes (1) There are several classifying hyper planes: x i2 Which one is the best? Criterion 1: Robustness for inserting new examples Criterion 2: Quality of prediction (e.g. minmimal cost of prediction errors) x i1 7 -
Hyper Planes (2) Criterion1 : Robustness if new examples are inserted Criterion 2: Quality of prediction h1 h h2 x i2 Maximum-Margin- Hyperplane (the plane for which the minimal distance to positive and negative examples is maximal) Margin (boundary) of breadth d =2q x i1 8 -
Properties of themaximum-margin-hyperplane x i2 minimal probability, that h changes if new examples are inserted maximal expected robustness of the prediction maximal expected quality of prediction Consequence: x maximal i1 distance q = min i ( w,x i + d) / w of the next neighbors of positive and negative examples Learning task: construct the Maximum-Margin-Hyperplane 9 -
Construction of the Maximum-Margin Hyperplane x i2 Observe: The Maximum-Margin-Hyperplane depends only from those positive and negative examples with minimal distance! The corresponding vectors are called Support-Vectors methods for determining MMH are called Support Vector Machines Breath q x i1 10 -
Which Points Separates the Margin? for b i = +1 we have w,x i + d 0 for b i = -1 we have w,x i + d < 0 b i ( w,x i + d) 0 for all i (with h) b i ( w,x i + d) / w q for all i x i2 h 1 h 2 h for b i = +1 w,x i + d (with h 1 ) for b i = -1 holds w,x i + d - (with h 2 ) b i ( w,x i + d) for all i (with h 1 and h 2 ) q, maximal q = / w x i1 11 -
Search w and d with Maximal Margin Look fore q = min i b i ( w,x i + d) / w is maximal resp, = min i b i ( w,x i + d) is maximal Idea: for the hyperplane h: the direction of w is fixed, d and w are variable Normal form : choose w such that = 1 q = 1 / w the margin has then a width of 2 / w 2 / w should maximal with b i ( w,x i + d) 1 for all i ½ w should be minimal with b i ( w,x i + d) 1 for all i ½ w,w should be minimal with b i ( w,x i + d) 1 for all i 12 -
Learning = Optimizing The Optimization Problem Determine w and d such that ½ w,w minimal (Goal) b i ( w,x i + d) 1 0 for all i from {1,..., m} (Condition) How to solve such an optimization problem in general? Given: f, c 1,..., c m, R n R f(w) minimal/maximal c j (w) = 0 for all j from {1,..., m} Wanted: w in R n such that (Goal) (Condition) General approach: Langrange multipliers! L(w) = f(w) - 1 c 1 (w) -... - m c m (w) with variables j (Lagrange multipliers): n+m unknowns: w 1,..., w n, 1,..., m n+m equations for the extrema: dl/dw i = 0 and c j (w) = 0 13 -
Noise, not Representative Data h 1 / w Basic Idea: Soft Margin instead of Margin See chapter on PAC- Learning 2 / w 3 / w 14 -
Weakly Separating Hyperplane Choose c IR, c > 0, and minimize Such that for all i : f(x i ) = w*x i +b 1- for y i = 1 and f(x i ) = w*x i +b -1+ i for y i = -1 w 2 c n i 1 i Equivalent: y i *f(x i ) 1- i 15 - +1 f
Meaning of =0 =0 >1 0< <1 16 - f(x)=-1 f(x)=0 f(x)=1
Not Linearly Separable Data In applications linearly separable data are rare. An approach: Remove a minimal set of points such that they are linearly separable (i.e. minimal classification error). Problem: Algorithm is exponential.? 17 -
More Complex Examples (1) x i2 Here we consider examples where the nonlinearity is not a consequence of noise but results from the nature of the problem +1-1 x i1 18 -
More Complex Examples (2) Idea: Transformation of domain X into another space X such that: in X' positive and negative training examples are linearly separable Remark: The dimension of X and X' may be different! a X X' = (X) 19 -
More Complex Examples (3) x i2 z 3 n = 3 +1 x i1-1 +1-1 n = 2 z 2 X' = (X) z 1 Classificator: non-linear ellipsoid (x) = (x 1,x 2 ) = (x 12, 2x 1 x 2, x 22 ) Classificator: linear hyperplane 20 -
More General: Kernels Kernel function = Inner product in some space (may be very complex) Kernel methods: Explore the properties of an inner product space Kernels occur in many machine learning methods, not only in this chapter. 21 -
General Kernel Functions Instead of Scalar Examples: Product Polynomials, homogeneous: K(x,y) = (x y) d Polynomials, inhomogeneous) : K(x,y) = (x y +1) d Radial Basis Function: for g > 0 K( x, y) exp( g x They describe situations of non-linear separable character y 2 ) 22 -
The Kernel Trick (1) the kernel trick is a method for using a linear classifier algorithm to solve a non-linear problem by mapping the original non-linear observations into a higher-dimensional space, where the linear classifier is subsequently used. 23 - This makes a linear classification in the new space equivalent to non-linear classification in the original space.
The Kernel Trick (2) This is done using Mercer s theorem: Any continuous, symmetric function K(x, y) with K(x, y) 0 can be expressed as a scalar product in a high-dimensional space: There exists a function φ(x) such that K(x, y) = φ(x) φ(x) 24 -
Example X X' = (X) 25 -
Simplicity Einschub: Prinzip der strukturellen Risikominimierung Too simple: Errors Right simplicity Correct but not simple enough
Complexity Problem (1) Training a support vector machine (SVM) requires solving a quadratic programming (QP) problem in a number of coefficients equal to the number of training examples. For very large datasets, standard numeric techniques for QP become infeasible. Practical techniques decompose the problem into manageable sub problems over part of the data. 27 - A disadvantage of this technique is that it may give an approximate solution, and may requires many passes through the dataset to reach a reasonable level of convergence.
Complexity Problem (2) An on-line alternative is training an SVM incrementally. However, adding new data by discarding all previous data except their support vectors, gives only approximate results. A better way is to do incremental learning as an exact on-line method to construct the solution recursively, one point at a time. The key is to keep the opimization conditions on all previously 28 - seen data, while adiabatically adding a new data point to the solution.
Complexity Problem (3) In adiabatic increments the margin vector coefficients change value during each incremental step to keep all elements in equilibrium, i.e., keep optimization conditions satisfied. The examples are added one by one At each step, the valid margins are updated by expressing the new solutions in terms of the old solution plus a new term. A MATLAB package implements the methods for exact 29 - incremental/decremental SVM learning, regularization parameter perturbation and kernel parameter perturbation presented in "SVM Incremental Learning, Adaptation and Optimization" by Christopher Diehl and Gert Cauwenberghs.
Applications (1) There are many and an increasing number of applications. We mention some typical ones. Face recognition Text classification Generalized Predictive Control: Controlling chaotic dynamics with small parameter perturbations. Statistical learning theory for geo(spatial) and spatio-temporal environmental data analysis and modelling. Comparisons with geostatistical predictions and simulations. Personalized and learner centered learning is receiving increasing importance 30 - due to increased learning rate. SVMs stand out due to their better performance specially in handling large dimensions which text content do possess.
Applications (2) NewsRec is a SVM-driven personal recommender system designed for news websites and uses SVMs for prediction wether articles are interesting or not. Bioinformatics applications: Coding sequences in DNA encode proteins. Protein remote homology detection: It is a central problem in computational biology. Supervised learning algorithms based on support vector machines are currently one of the most effective methods for 31 - remote homology detection. Such applications are typical for the kind of problem where SVMs do well
typische Anwendungen SVMs in OCR (Optical Character Recognition) Data bases with hand written digits
typische Anwendungen SVMs in OCR (Optical Character Recognition)
typische Anwendungen SVMs in image recognition
Tools Libsvm toolbox function svmtrain can be used to train the network function svmpredict can be used to classify the testing data In function svmtrain, it has an option called kernel type which has 4 values that are linear, ploynomial, radial basis, and sigmoid. Radial basis kernel type is often recommended 35 - DTREG generates SVM, decision tree and logistic regression models http://www.dtreg.com
Summary (1) : Main Elements S Choosen parameters (kernel,...) Form of admissible hypotheses Effiziency requirements for learning Relevant aspects Quality citeria Effiziency requirements for classifiers Interpretability of classifiers Correctness of data
Summary (2) Geometric interpretation Hyperplanes and hypersurfaces Kernels Best separation Non-linear separability Application and tools 37 -
Recommended Literature T. Mitchell: Machine Learning. McGraw Hill, 1997 Schölkopf, Support Vector Learning, Oldenbourg, 1997 http://www.kernel-machines.org I.Bratko, I.Kononenko (1987) Learning Diagnostic Rules from Incomplete and Noisy Data, AI Methods in statistics, 16-17 dec.1986, London, In: B. Phelps (ed.) Interactions in Artificial Intelligence and Statistical Methods, Technical Press. Serdar Iplikci: Support vector machines-based generalized predictive control., INTERNATIONAL JOURNAL OF ROBUST AND NONLINEAR CONTROL, Vol. 16, pp. 843-862, 2006 M Kanevski, N Gilardi, E Mayoraz, M Maignan: Spatial Data Classification with Support Vector Machines. Geostat 2000 congress. South Africa, April 2000. 38 - Gert Cauwenbergh, Tomaso Poggio: Incremental and Decremental Support Vector Machine Learning. In: Advances in Neural Information Processing Systems, volume 13, 2001.