Kernel Methods and Support Vector Machines

Similar documents
Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Support Vector Machines. Maximizing the Margin

Support Vector Machines. Goals for the lecture

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Pattern Recognition and Machine Learning. Artificial Neural networks

Estimating Parameters for a Gaussian pdf

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Support Vector Machines MIT Course Notes Cynthia Rudin

Soft-margin SVM can address linearly separable problems with outliers

Geometrical intuition behind the dual problem

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

1 Bounding the Margin

Bayes Decision Rule and Naïve Bayes Classifier

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Combining Classifiers

UNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer.

Support Vector Machine (continued)

Pattern Recognition 2018 Support Vector Machines

Feature Extraction Techniques

Introduction to Discrete Optimization

Machine Learning: Fisher s Linear Discriminant. Lecture 05

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

PAC-Bayes Analysis Of Maximum Entropy Learning

Predictive Vaccinology: Optimisation of Predictions Using Support Vector Machine Classifiers

Pattern Recognition and Machine Learning. Artificial Neural networks

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

CS Lecture 13. More Maximum Likelihood

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Ensemble Based on Data Envelopment Analysis

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Support Vector Machine (SVM) and Kernel Methods

Kernel Methods and Support Vector Machines

Lecture 9: Multi Kernel SVM

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

Research Article Robust ε-support Vector Regression

a a a a a a a m a b a b

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Probability Distributions

COS 424: Interacting with Data. Written Exercises

Least Squares Fitting of Data

1 Proof of learning bounds

Boosting with log-loss

Ch 12: Variations on Backpropagation

Max Margin-Classifier

Topic 5a Introduction to Curve Fitting & Linear Regression

The Methods of Solution for Constrained Nonlinear Programming

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Learning Methods for Linear Detectors

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Machine Learning Basics: Estimators, Bias and Variance

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

HIGH RESOLUTION NEAR-FIELD MULTIPLE TARGET DETECTION AND LOCALIZATION USING SUPPORT VECTOR MACHINES

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Computational and Statistical Learning Theory

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x)

Probabilistic Machine Learning

Introduction to Machine Learning. Recitation 11

Principal Components Analysis

Randomized Recovery for Boolean Compressed Sensing

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear & nonlinear classifiers

Linear & nonlinear classifiers

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Robustness and Regularization of Support Vector Machines

List Scheduling and LPT Oliver Braun (09/05/2017)

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

Convolutional Codes. Lecture Notes 8: Trellis Codes. Example: K=3,M=2, rate 1/2 code. Figure 95: Convolutional Encoder

Fairness via priority scheduling

Block designs and statistics

Support Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs

Feedforward Networks

Mathematical Model and Algorithm for the Task Allocation Problem of Robots in the Smart Warehouse

HESSIAN MATRICES OF PENALTY FUNCTIONS FOR SOLVING CONSTRAINED-OPTIMIZATION PROBLEMS

(Kernels +) Support Vector Machines

Nonmonotonic Networks. a. IRST, I Povo (Trento) Italy, b. Univ. of Trento, Physics Dept., I Povo (Trento) Italy

Introduction to Support Vector Machines

Support Vector Machine

Domain-Adversarial Neural Networks

Fault Diagnosis of Planetary Gear Based on Fuzzy Entropy of CEEMDAN and MLP Neural Network by Using Vibration Signal

Feedforward Networks. Gradient Descent Learning and Backpropagation. Christian Jacob. CPSC 533 Winter 2004

Chapter 6 1-D Continuous Groups

A Theoretical Analysis of a Warm Start Technique

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Optimum Value of Poverty Measure Using Inverse Optimization Programming Problem

Support Vector Machines

arxiv: v2 [cs.lg] 30 Mar 2017

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Announcements - Homework

CS798: Selected topics in Machine Learning

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

Transcription:

Intelligent Systes: Reasoning and Recognition Jaes L. Crowley ENSIAG 2 / osig 1 Second Seester 2012/2013 Lesson 20 2 ay 2013 Kernel ethods and Support Vector achines Contents Kernel Functions...2 Quadratic as a Kernel Function... 4 Gaussian Kernel... 5 Kernel functions for Sybolic Data... 6 Kernels for Bayesian Reasoning... 6 Support Vector achines...7 Hard-argin SVs - Separable Training Data... 8 Soft argin SV's - Non-separable training data.... 12 Sources Bibliographiques : Neural Networks for Pattern Recognition, C.. Bishop, Oxford Univ. Press, 1995. A Coputational Biology Exaple using Support Vector achines, Suzy Fei, 2009 (on line.

Kernel ethods and Support Vector achines Lesson 20 Kernel Functions Linear discriinant functions can be very efficient classifiers, provided that the class features can be separated by a linear decision surface. The linear discriinant function is: g( X = W T X + b The decision rule is IF W T X + b > 0 THEN E C 1 else E C 1 For any doains, it is easier to separate the classes with a linear function if you can transfor your feature data into a space with a higher nuber of diensions. One way to do this is to transfor the features with a kernel function. Instead of a decision surface: g( X = W T X + b We use a decision surface g( X = W T ( X + b This can be used to construct non-linear decision surfaces for our data. The trick is to learn with the non-linear apping, but to use the resulting discriinant without actually coputing the apping. Forally, a kernel function is any function that satisfies the ercer condition. ercer s condition requires that for a finite set of observations S, one can select a easure µ(t = T for all T S such that k( X i, X j c i c j # 0 i=1 j=1 20-2

Kernel ethods and Support Vector achines Lesson 20 This condition is satisfied by inner products. Inner products are of the for W T X = W, X = D d=1 w d x d Thus k( x, z = ( x,( z is a valid kernel function. The ercer function can be satisfied by other functions. For exaple, the Gaussian function: k( x, z = e is a valid Kernel. x z 2 2# 2 We can learn the discriinant in an inner product space W T ( X = W,( X where the vector W will be learned fro the apped values of our training data. This will give us a discriinant function of the for: g( X = # a y ( X,( X + b =1 We will see that we can learn in the kernel space, and then recognize without actually having to copute the kernel function Kernels can be extended to infinite diensional spaces and even to non-nuerical and sybolic data 20-3

Kernel ethods and Support Vector achines Lesson 20 Quadratic as a Kernel Function A coon kernel is the quadratic kernel k( x, z = ( z T x + c 2 = ( z T ( x For D=2, this gives ( # % % X = % % % $ 2 x 1 & 2 ( x 2 ( 2x 1 x 2 ( ( 2x 1 c ( 2x 2 c ' This Kernel aps a 2-D feature space to a 5 D kernel space. This can be used to ap a plane to a hyperbolic surface. 20-4

Kernel ethods and Support Vector achines Lesson 20 Gaussian Kernel The Gaussian exponential is very often used as a kernel function. In this case: k( x, z = e x z 2 2# 2 This satisfies ercer s condition because the exponent is separable x z 2 = x T x 2 x T z + z T z Intuitively, you can see this as placing a Gaussian function ultiplied by the indicator variable (y = +/- 1 at each training saple, and then suing the functions. The zero-crossings in the su of Gaussianss defines the decision surface. Depending on σ, this can provide a good fit or an over fit to the data. If σ is large copared to the distance between the classes, this can give an overly flat discriinant surface. If σ is sall copared to the distance between classes, this will overfit the saples. A good choice for σ will be coparable to the distance between the closest ebers of the two classes. Figure fro lecture A Coputational Biology Exaple using Support Vector achines Suzy Fei, 2009 (on line. Aong other properties, the feature vector can have infinite nuber of diensions. 20-5

Kernel ethods and Support Vector achines Lesson 20 Kernel functions for Sybolic Data Kernel functions can be defined over graphs, sets, strings and text Consider for exaple, a non-vector space coposed of a set of words S. Consider two subsets of S: A S and B S We can define a kernel function of A and B using the intersection operation. k(a, B = 2 AB where. denotes the cardinality (the nuber of eleents of a set. Kernels for Bayesian Reasoning We can define a Kernel for Baysian reasoning for evidence accuulation. Given a probabilistic odel p(x we can define a kernel as: k( X, Z = p( X p( Z This is clearly a valid kernel because it is a 1-D inner product. Intuitively, it says that two feature vectors, X and Z, are siilar if they both have high probability. We can extend this for conditional probabilities to k( Z, X = N p( n=1 Z p( X Ap(A Two vectors, X, Z, will give large values for the kernel, and hence be seen as siilar, if they have significant probability for the sae coponents. 20-6

Kernel ethods and Support Vector achines Lesson 20 Support Vector achines A significant liitation for linear leaning ethods is that the kernel function ust be evaluated for every training point during learning. An alternative is to use a learning algorith with sparse support - that is a uses only a sall nuber of points to learn the separation boundary. A Support Vector achine (SV is such an algorith. SV's are popular for probles of classification, regression and novelty detection. The solution of the odel paraeters corresponds to a convex optiisation proble. Any local solution is a global solution. We will use the two class proble, K=2, to illustrate the principle. ulti-class solutions are possible. Our linear odel is for the decision surface is g( X = W T ( X + b Where ( X is a feature space transforation that aps a hyper-plane in F diensions into a non-linear decision surfaces in D diensions. Training data is a set of training saples { X }and their indicator variable, { y }. As in our last lecture, for a 2 Class proble, y is -1 or +1. A new, observed point (not in the training data will be classified using the function sign(g( X, so that a classification of a training saple is correct if y g( X > 0 20-7

Kernel ethods and Support Vector achines Lesson 20 Hard-argin SVs - Separable Training Data We assue that the two classes can be separated by a linear function. That is, there exists a hyper-plane g( X = w T ( X + b such that y g( X > 0 for all. Generally there will exist any solutions for separable data. The argin, γ, is the iniu distance of any saple fro the hyper-plane For a support Vector achine, we will deterine the decision surface that axiizes the argin, γ. What we are going to do is design the decision boundary to that it has an equal distance fro a sall nuber of support points. The distance for a point fro the hyper-plane is g( X w y g( X > 0 for all training points The distance for the point X to the decision surface is: y g( X = y ( w T ( X + b w w For a decision surface, (W, b, the support vectors are the saples fro the training set, { } that iniize the argin, γ, X in { } = in $ 1 w y ( w T #( X ' % + b ( & We will seek to axiize the argin by solving 20-8

Kernel ethods and Support Vector achines Lesson 20 # 1 arg ax w,b w in y ( w T ( X & $ { + b }' % ( The factor. 1 w can be reoved fro the optiization because w does not depend on Direct solution would be very difficult, but the proble can be converted to an equivalent proble. Note that rescaling the proble changes nothing. Thus we will scale the equation such for the saple that is closest to the decision surface (sallest argin: y ( w T ( X + b =1 that is: y g( X =1 For all other saple points: y ( w T ( X + b #1 This is known as the Canonical Representation for the decision hyperplane. The training saple where y ( w T ( X + b =1 are said to be the active constraint. All other training saples are inactive. By definition there is always at least one active constraint. When the argin is axiized, there will be two active constraints. 1 Thus the optiization proble is to axiize arg in# w,b $ 2 constraints. The factor of ½ is a convenience for later analysis. w 2 % & ' subject to the active To solve this proble, we will use Lagrange ultipliers, a 0, with one ultiplier for each constraint. This gives a Lagrangian function: L( w,b, a = 1 2 w 2 a y ( w T #( X + b 1 $ =1 { } 20-9

Kernel ethods and Support Vector achines Lesson 20 Setting the derivatives to zero, we obtain: L w = 0 # w = a y ( X # =1 L b = 0 # a y = 0 =1 Eliinating w,b fro L( w,b, a we obtain : L(a = a # 1 2 =1 with constraints: a a n y y n k( X, X n =1 n=1 a 0 for =1,..., # a y = 0 1 where the kernel function is : k( X 1, X 2 = ( X 1 T ( X 2 The solution takes the for of a quadratic prograing proble in D variables (the Kernel space. This would norally take O(D 3 coputations. In going to the dual forulation, we have converted this to a dual proble over data points, requiring O( 3 coputations. This can appear to be a proble, but the solution only depends on a sall nuber of points To classify a new observed point, we evaluate: g( X = a y k( X, X + b =1 The solution to optiization probles of this for satisfy the Karush-Kuhn-Tucker condition, requiring: 20-10

Kernel ethods and Support Vector achines Lesson 20 a 0 y g( X 1# 0 a y g( X 1 { } # 0 For every data point in the training saples, { X }, either a = 0 or y g( X =1 Any point for which a = 0 does not contribute to g( X = a y k( X, X + b and thus is not used (is not active. The reaining points, for which a 0 are called the Support vectors. These points lie on the argin at t y( X =1 of the axiu argin hyperplane. Once the odel is trained, all other points can be discarded Let us define the support vectors as the set S. Now that we have solved for S and a, we can solve for b: $ we note that : y # a n y n k( X, X ' & n + b =1 % ns ( averaging over all support vectors in S gives: b = 1 % y N a n y n k( X, X ( $ ' $ n * S & #S n#s This can be expressed as iniization of an error function, E (z such that the error function is zero if z 0 and otherwise. =1 Fro Bishop p 331. 20-11

Kernel ethods and Support Vector achines Lesson 20 Soft argin SV's - Non-separable training data. So far we have assued that the data are linearly separable in ( X. For any probles soe training data ay overlap. The proble is that the error function goes to for any point on the wrong side of the decision surface. This is called a hard argin SV. We will relax this by adding a slack variable, S for each training saple: S 1 We will define S =0 for saples on the correct side of the argin, and S = y g( X for other saples. For a saple inside the argin, but on the correct side of the decision surface: 0 < S 1 For a saple on the decision surface: S = 1 For a saple on the wrong side of the decision surface: S > 1 Soft argin SV: Bishop p 332 (note use of ξ in place S This is soeties called a soft argin. To softly penalize points on the wrong side, we iniize : 20-12

Kernel ethods and Support Vector achines Lesson 20 C S + 1 2 =1 w 2 where C > 0 controls the tradeoff between slack variables and the argin. because any isclassified point S > 1, the upper bound on the nuber of isclassified points is is S. =1 C is an inverse factor. (note that C= is the earlier SV with hard argins. To solve for the SV we write the Lagrangian: L( w,b, a = 1 2 w 2 + C S # a y g( X { #1+ S }# µ S =1 =1 =1 The KKT conditions are a 0 y g( X 1+ S # 0 a y g( X 1+ S { } # 0 µ 0 S 1 µ S = 0 Solving the derivatives of L( w,b, a for zero gives L w = 0 # w = a y ( X # =1 L b = 0 # a t = 0 =1 L S = 0 # a = C µ using these to eliinate w, b and {S } fro L(w, b, a we obtain L(a = a # 1 2 =1 a a n y y n k( X, X n =1 n=1 20-13

Kernel ethods and Support Vector achines Lesson 20 This appears to be the sae as before, except that the constraints are different. 0 a C a y = 0 =1 (referred to as a box constraint. Solution is a quadratic prograing proble, with coplexity O( 3. However, as before, a large subset of training saples have a = 0, and thus do not contribute to the optiization. For the reaining points y g( X =1 S For saples ON the argin a < C hence µ > 0 requiring that S = 0 For saples INSIDE the argin: a = C and S 1 if correctly classified and S > 1 if isclassified. as before to solve for b we note that : y $ # a n y n k( X, X ' & n + b =1 % ns ( averaging over all support vectors in S gives: b = 1 % y N a n y n k( X, X ( $ ' $ n * S & #T n#s where T denotes the set of support vectors such that 0 < a < C. 20-14