Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes and Hyper-planes... 3 Perceptrons...4 History... 4 Theory... 4 Perceptron argin... 6 Full perceptron algorith (Optional)... 6 Support Vector achines...7 Hard-argin SVs - a siple linear classifier.... 7 Finding the Support Vectors... 8 Soft argin SV... 10 SV with Kernels...11 Radial Basis Function Kernels:... 12 Kernel Functions for Sybolic Data... 13

Notation x d A feature. An observed or easured value. X A vector of D features. D The nuber of diensions for the vector X { x } Training saples for learning. {y } The indicator variable for each training saple, y = +1 for exaples of the target pattern (class 1) y = 1 for all other exaples (class 2) w = b w 1 w 2 w D & The nuber of training saples. The coefficients for a linear odel the odel. bias so that w T x 0 for P and w T x + w 0 < 0 for N. IF w T x 0 THEN P ELSE N IF y w T x ( ) 0 THEN T ELSE F { ( w T x ) } argin for the classifier = in y 6-2

Linear odels Lines, Planes and Hyper-planes ( a quick review of basic geoetry) In a 2-D space, a line is a set of points that obeys the relation: w 1 x 1 + w 2 x 2 = 0 ( ) x 1 In vector notation: w 1 w 2 x 2 = 0 which is written : & w T x = 0 If w =1 then b is the perpendicular distance to the origin. For points that are not on the hyperplane, the linear equation gives the signed perpendicular distance to the hyperplane. d = w 1 x 1 + w 2 x 2 For points on one side of the line, d > 0. On the other side d < 0. On the line d=0. if w T x = 0 is a boundary then the sign of d can be used as a decision function. Lines, planes and hyperplanes can act as boundaries for regions of a feature space. This is called a first order hoogeneous equation because the all ters are first order. This is easily generalized to any nuber of diensions (features). In a 3D space, a plane is represented by w 1 x 1 + w 2 x 2 + w 3 x 3 = 0 In a D diensional space, linear hoogeneous equation is called a hyper-plane. The equation w T x = 0 is a hyper-plane in a D-diensional space, w 1 w 2 w = is the noral to the hyper-plane and b is a constant ter. w D & The signed distance is : d = w T X = w 1 x 1 + w 2 x 2 +...+ w D x D indicates which side of the boundary a point lies. Note that it is soeties convenient to include the bias in the vector: x 1 ( w 1 w 2 b) x 2 = 0 1 & 6-3

Perceptrons History During the 1950 s, Frank Rosenblatt developed the idea to provide a trainable linar classifier for pattern recognition, called a Perceptron. The first Perceptron was a roo-sized analog coputer that ipleented Rosenblatz s learning algorith. Both the learning algorith and the resulting recognition algorith are easily ipleented as coputer progras. In 1969, arvin insky and Seyour Patert of IT published a book entitled Perceptrons, that claied to docuent the fundaental liits of the perceptron approach. For exaple, they showed that a linear classifier could not be constructed to perfor an exclusive OR. While this is true for a one-layer perceptron, it is not true for ulti-layer perceptrons. Theory The perceptron is an on-line learning ethod in which a linear classifier is iproved by its own errors. A perceptron learns a linear decision boundary (hyper-plane) that separates training saples. When the training data can be separated by a linear decision boundary, the data is said to be separable. The perceptron algorith uses errors in classifying the training data to update the decision boundary plane until there are no ore errors. The learning can be increental. New training saples can be used to update the perceptron. If the training data is non-separable, the ethod ay not converge, and ust be stopped after a certain nuber of iterations. Assue a training set of observations { x } {y } where x 1 x 2 x = and y = { 1, +1} & x D The indicator variable, {y }, for each saple, y = +1 for exaples of the target class (P) y = 1 for all others classes (N) 6-4

The Perceptron will learn the coefficients for a linear boundary w = w 1 w 2 w D & and b Such that for all training data, x, w T x 0 for P and w T x < 0 for N. Note that w T X 0 is the sae as w T X b. Thus b can be considered as a threshold on the product : w T X The decision function is the d(z) function: d(z) = P if z 0 N if z < 0 The algorith requires a learning rate, α. Note that a training saple is correctly classified if: y w T x ( ) 0 The perceptron algorith uses isclassified training data as a correction. The algorith will continue to loop thugh the training data until it akes an entire pass without a single iss-classified training saple. If the training data are not separable then it will continue to loop forever. Update rule: FOR = 1 TO DO IF y w T x w w + y ( ) < 0 THEN x b b + y R where R is a scale factor for the data R = ax x If the data is not separable, then the Perceptron will not converge, and continue to loop. Thus it is necessary to have a liit the nuber of iterations. The final classifier is: if w T x 0 then P else N. 6-5

Perceptron argin For a 2-Class classifier, the argin, γ, is the sallest separation between the two classes. The argin, γ, for each saple,, is = y w T x ( ) The argin for a classifier and a set of training data is the iniu argin of the training data = in y w T x { ( )} The quality of a perceptron is given by the histogra of the argins, h(γ) for the training data. Full perceptron algorith (Optional) For reference, here is the full perceptron algorith. Algorith: w (0) 0; b (0) 0, i 0; R = ax x WHILE update DO update FALSE; FOR = 1 TO DO IF y w (i)t x (i) ( ) < 0 THEN update TRUE w (i+1) w (i) + y b (i+1) b (i) + y R i i + 1 END IF END FOR END WHILE. x 6-6

Support Vector achines Support Vector achines (SV), also known as axiu argin classifiers are popular for probles of classification, regression and novelty detection. The solution of the odel paraeters corresponds to a convex optiization proble. SVs use a inial subset of the training data (the support vectors) to define the best decision surface between two classes. We will use the two class proble, K=2, to illustrate the principle. ulti-class solutions are possible. The siplest case, the hard argin SV, require that the training data be copletely separated by at least one hyper-plane. This is generally achieved by using a Kernel to ap the features into a high diensional space were the two classes are separable. To illustrate the principle, we will first exaine a siple linear SV where the data are separable. We will then generalize with Kernels and with soft argins. We will assue that the training data is a set of training saples indicator variable, { y }, where, y is -1 or +1. { x } with an Hard-argin SVs - a siple linear classifier. The siplest case is a linear classifier trained fro separable data. g( x ) = w T x with the decision rule is: IF g( x ) 0 THEN P else N For a hard argin SV we assue that the two classes are separable for all of the training data: : y ( w T x ) 0 We will use a subset S of the training saples, { x s } x saples to define the best decision surface g( x ) = w T x. { } coposed of s training The iniu nuber of support vectors depends on the nuber of features, D: s = D+1 The s selected training saples { x s } are called the support vectors. For exaple, in a 2D feature space we need only 3 training saples to serve as support vectors. 6-7

To define the classifier we will look for the subset if S training saples that axiizes the argin ( ) between the two classes. argin: = in y w T x { ( )} Thus we seek a subset of s =D+1 training saples, { x s } { x } to define a decision surface and the argin. We will use these saples as support vectors. Our algorith ust choose D+1 training saples { x s } fro the Training saples in { x } such that the argin is a large as possible. This is equivalent to a search for a pair of parallel surfaces a distance γ fro the decision surface. Finding the Support Vectors For any hyperplane with noralized the coefficients: w =1 the distance of any saple point x fro the hyper-plane, w is d = y w T x ( ) The argin is the iniu distance = in{y ( w T x )} For the D+1 support vectors x { x s } d = γ For all other training saples x { x s }: d γ The scale of the argin is deterined by w. To find the support vectors, we can arbitrarily define the argin as γ = 1 and then renoralize w =1 once the support vectors have been discovered. With γ = 1, we will look for two separate hyper-planes that bound the decision surface, such that for points on these surfaces: 6-8

w T x =1 and w T x = 1 The distance between these two planes is 2 w We will add the constraint that for all training saples that are support vectors w T x 1 while for all other saples: w T x < 1 This can be written as: y ( w T x ) 1 This is known as the Canonical Representation for the decision hyperplane. This gives we have an optiization algorith that iniizes w subject to y ( w T x ) 1. For points on the argin: y ( w T x ) =1 for all other points y ( w T x ) >1 The training saple where y ( w T x ) =1 are said to be active. All other training saples are inactive. 1 Thus the optiization proble is to axiize arg in w 2 & while iniizing the w,b 2 nuber of active points, S. (The factor of ½ is a convenience for analysis.) This can be solved as a quadratic optiization proble using Lagrange ultipliers, with one ultiplier, a 0, for each constraint. The Lagrangian function is: L( w,b, a ) = 1 2 Setting the derivatives to zero, we obtain: L w = 0 w = a y x =1 w 2 a y ( w T x ) 1 =1 { } L b = 0 b = 1 w T x y s x S 6-9

Soft argin SV We can extend the SV to the case where the training data are not separable by introducing the hinge loss function. { } hinge loss: L = ax 0,1 y ( w T x ) This loss is 0 is the data is separable, and non-zero if the data is non-separable. Using the hinge loss, we can iniize 1 ax 0,1 y ( & { w T x ) }) + * ( =1 w 2 To set this up as an optiization proble, we note that L = ax 0,1 y ( w T x ) { } is the sallest non-negative that satisfies y ( w T x ) 1 L we rewrite the optiization as iniize 1 L + w 2 subject to y ( w T x ) 1 L and L 0 for all. =1 This was classically solved as a quadratic optiization proble with Lagrange ultipliers. axiize subject to a 1 =1 2 i=1 j=1 y i a i ( x T i x j ) y j a j a y = 0 and 0 a 1 for all 2 =1 The resulting decision surface is : w = a y =1 and the bias is found by solving for: b = w T x y for any support vector: x { x s } odern approaches use Gradient Descent to solve for w and b, offering iportant advantages for large data sets. x 6-10

SV with Kernels Support Vector achines can be generalized for use with non-linear decision surfaces using kernel functions. This provides a very general and efficient classifier. A kernel function, f(z), transfors the data into a space where the two classes are separate. Instead of a decision surface: g( x ) = w T x We use a decision surface g( x ) = w T f ( x ) the coefficients w and the bias b are provided by the D+1 support vectors. For this we need to include the where w = a y f ( x ) =1 is learned fro the transfored training data. As before, S = x s { } are the support vectors, and a is a variable learned fro the training data that is a 0 for support vectors and 0 for all others. 6-11

Radial Basis Function Kernels: Radial Basis functions are a popular kernel function: A radial basis function (RBF) is a real-valued function whose value depends only on the distance fro the origin. For Exaple, the Gaussian function f (z) = e is a popular Radial Basis Function, and is often used as a kernel for support vector achines, with radial basis functions: z = X x s for each support vectors. We can use a su of s radial basis functions to define a discriinant function, where the support vectors are drawn fro the training saples. This gives a discriinant function g( X ) = a y f ( X x ), =1 The training saples X for which a 0are the support vectors. Depending on σ, this can provide a good fit or an over fit to the data. If σ is large copared to the distance between the classes, this can give an overly flat discriinant surface. If σ is sall copared to the distance between classes, this will over-fit the saples. A good choice for σ will be coparable to the distance between the closest ebers of the two classes. z 2 2 2 (iages fro A Coputational Biology Exaple using Support Vector achines, Suzy Fei, 2009) Each Radial Basis Function is a diension in a high diensional basis space. 6-12

Kernel Functions for Sybolic Data Kernel functions can be defined over graphs, sets, strings and text Consider for exaple, a non-vector space coposed of a set of words {W}. We can select a subset of discriinant words {S} {W} Now given a set of words (a probe), {A} {W} We can define a kernel function of A and S using the intersection operation. k(a,s) = 2 AS where. denotes the cardinality (the nuber of eleents) of a set. 6-13