Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Similar documents
Kernel Methods and Support Vector Machines

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Pattern Recognition and Machine Learning. Artificial Neural networks

Pattern Recognition and Machine Learning. Artificial Neural networks

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Support Vector Machines. Maximizing the Margin

Support Vector Machines MIT Course Notes Cynthia Rudin

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

Support Vector Machines. Goals for the lecture

Bayes Decision Rule and Naïve Bayes Classifier

Estimating Parameters for a Gaussian pdf

Combining Classifiers

1 Bounding the Margin

Model Fitting. CURM Background Material, Fall 2014 Dr. Doreen De Leon

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

Soft-margin SVM can address linearly separable problems with outliers

Pattern Recognition and Machine Learning. Artificial Neural networks

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Geometrical intuition behind the dual problem

A Simple Regression Problem

UNIVERSITY OF TRENTO ON THE USE OF SVM FOR ELECTROMAGNETIC SUBSURFACE SENSING. A. Boni, M. Conci, A. Massa, and S. Piffer.

Least Squares Fitting of Data

Feature Extraction Techniques

Ensemble Based on Data Envelopment Analysis

Machine Learning: Fisher s Linear Discriminant. Lecture 05

HIGH RESOLUTION NEAR-FIELD MULTIPLE TARGET DETECTION AND LOCALIZATION USING SUPPORT VECTOR MACHINES

Sharp Time Data Tradeoffs for Linear Inverse Problems

CS Lecture 13. More Maximum Likelihood

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Boosting with log-loss

Ch 12: Variations on Backpropagation

Predictive Vaccinology: Optimisation of Predictions Using Support Vector Machine Classifiers

COS 424: Interacting with Data. Written Exercises

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

arxiv: v2 [cs.lg] 30 Mar 2017

E0 370 Statistical Learning Theory Lecture 5 (Aug 25, 2011)

Grafting: Fast, Incremental Feature Selection by Gradient Descent in Function Space

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

E0 370 Statistical Learning Theory Lecture 6 (Aug 30, 2011) Margin Analysis

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

Robustness and Regularization of Support Vector Machines

This model assumes that the probability of a gap has size i is proportional to 1/i. i.e., i log m e. j=1. E[gap size] = i P r(i) = N f t.

Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Stochastic Subgradient Methods

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Rotational Prior Knowledge for SVMs

Lecture 9: Multi Kernel SVM

Block designs and statistics

Ştefan ŞTEFĂNESCU * is the minimum global value for the function h (x)

A Smoothed Boosting Algorithm Using Probabilistic Output Codes

Feedforward Networks. Gradient Descent Learning and Backpropagation. Christian Jacob. CPSC 533 Winter 2004

A method to determine relative stroke detection efficiencies from multiplicity distributions

Feedforward Networks

Inspection; structural health monitoring; reliability; Bayesian analysis; updating; decision analysis; value of information

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

Non-Parametric Non-Line-of-Sight Identification 1

Recovering Data from Underdetermined Quadratic Measurements (CS 229a Project: Final Writeup)

Symbolic Analysis as Universal Tool for Deriving Properties of Non-linear Algorithms Case study of EM Algorithm

Kinematics and dynamics, a computational approach

Spine Fin Efficiency A Three Sided Pyramidal Fin of Equilateral Triangular Cross-Sectional Area

PAC-Bayes Analysis Of Maximum Entropy Learning

Computational and Statistical Learning Theory

Linear Classifiers as Pattern Detectors

Probability Distributions

Topic 5a Introduction to Curve Fitting & Linear Regression

6.2 Grid Search of Chi-Square Space

Multilayer Neural Networks

Learning Methods for Linear Detectors

Convolutional Codes. Lecture Notes 8: Trellis Codes. Example: K=3,M=2, rate 1/2 code. Figure 95: Convolutional Encoder

Ph 20.3 Numerical Solution of Ordinary Differential Equations

Polygonal Designs: Existence and Construction

A MESHSIZE BOOSTING ALGORITHM IN KERNEL DENSITY ESTIMATION

Fairness via priority scheduling

ZISC Neural Network Base Indicator for Classification Complexity Estimation

a a a a a a a m a b a b

CHARACTER RECOGNITION USING A SELF-ADAPTIVE TRAINING

Homework 3 Solutions CSE 101 Summer 2017

Optimal nonlinear Bayesian experimental design: an application to amplitude versus offset experiments

Variations on Backpropagation

Qualitative Modelling of Time Series Using Self-Organizing Maps: Application to Animal Science

List Scheduling and LPT Oliver Braun (09/05/2017)

Introduction to Discrete Optimization

The Methods of Solution for Constrained Nonlinear Programming

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

Machine Learning Basics: Estimators, Bias and Variance

Keywords: Estimator, Bias, Mean-squared error, normality, generalized Pareto distribution

A Theoretical Analysis of a Warm Start Technique

arxiv: v1 [cs.ds] 29 Jan 2012

OBJECTIVES INTRODUCTION

ASSUME a source over an alphabet size m, from which a sequence of n independent samples are drawn. The classical

Variational Adaptive-Newton Method

arxiv: v1 [cs.ds] 3 Feb 2014

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair

Optimization of ripple filter for pencil beam scanning

RECOVERY OF A DENSITY FROM THE EIGENVALUES OF A NONHOMOGENEOUS MEMBRANE

Neural Networks Trained with the EEM Algorithm: Tuning the Smoothing Parameter

Randomized Recovery for Boolean Compressed Sensing

MSEC MODELING OF DEGRADATION PROCESSES TO OBTAIN AN OPTIMAL SOLUTION FOR MAINTENANCE AND PERFORMANCE

Transcription:

Intelligent Systes: Reasoning and Recognition Jaes L. Crowley osig 1 Winter Seester 2018 Lesson 6 27 February 2018 Outline Perceptrons and Support Vector achines Notation...2 Linear odels...3 Lines, Planes and Hyper-planes... 3 Perceptrons...4 History... 4 Theory... 4 Perceptron argin... 6 Full perceptron algorith (Optional)... 6 Support Vector achines...7 Hard-argin SVs - a siple linear classifier.... 7 Finding the Support Vectors... 8 Soft argin SV... 10 SV with Kernels...11 Radial Basis Function Kernels:... 12 Kernel Functions for Sybolic Data... 13

Notation x d A feature. An observed or easured value. X A vector of D features. D The nuber of diensions for the vector X { x } Training saples for learning. {y } The indicator variable for each training saple, y = +1 for exaples of the target pattern (class 1) y = 1 for all other exaples (class 2) w = b w 1 w 2 w D & The nuber of training saples. The coefficients for a linear odel the odel. bias so that w T x 0 for P and w T x + w 0 < 0 for N. IF w T x 0 THEN P ELSE N IF y w T x ( ) 0 THEN T ELSE F { ( w T x ) } argin for the classifier = in y 6-2

Linear odels Lines, Planes and Hyper-planes ( a quick review of basic geoetry) In a 2-D space, a line is a set of points that obeys the relation: w 1 x 1 + w 2 x 2 = 0 ( ) x 1 In vector notation: w 1 w 2 x 2 = 0 which is written : & w T x = 0 If w =1 then b is the perpendicular distance to the origin. For points that are not on the hyperplane, the linear equation gives the signed perpendicular distance to the hyperplane. d = w 1 x 1 + w 2 x 2 For points on one side of the line, d > 0. On the other side d < 0. On the line d=0. if w T x = 0 is a boundary then the sign of d can be used as a decision function. Lines, planes and hyperplanes can act as boundaries for regions of a feature space. This is called a first order hoogeneous equation because the all ters are first order. This is easily generalized to any nuber of diensions (features). In a 3D space, a plane is represented by w 1 x 1 + w 2 x 2 + w 3 x 3 = 0 In a D diensional space, linear hoogeneous equation is called a hyper-plane. The equation w T x = 0 is a hyper-plane in a D-diensional space, w 1 w 2 w = is the noral to the hyper-plane and b is a constant ter. w D & The signed distance is : d = w T X = w 1 x 1 + w 2 x 2 +...+ w D x D indicates which side of the boundary a point lies. Note that it is soeties convenient to include the bias in the vector: x 1 ( w 1 w 2 b) x 2 = 0 1 & 6-3

Perceptrons History During the 1950 s, Frank Rosenblatt developed the idea to provide a trainable linar classifier for pattern recognition, called a Perceptron. The first Perceptron was a roo-sized analog coputer that ipleented Rosenblatz s learning algorith. Both the learning algorith and the resulting recognition algorith are easily ipleented as coputer progras. In 1969, arvin insky and Seyour Patert of IT published a book entitled Perceptrons, that claied to docuent the fundaental liits of the perceptron approach. For exaple, they showed that a linear classifier could not be constructed to perfor an exclusive OR. While this is true for a one-layer perceptron, it is not true for ulti-layer perceptrons. Theory The perceptron is an on-line learning ethod in which a linear classifier is iproved by its own errors. A perceptron learns a linear decision boundary (hyper-plane) that separates training saples. When the training data can be separated by a linear decision boundary, the data is said to be separable. The perceptron algorith uses errors in classifying the training data to update the decision boundary plane until there are no ore errors. The learning can be increental. New training saples can be used to update the perceptron. If the training data is non-separable, the ethod ay not converge, and ust be stopped after a certain nuber of iterations. Assue a training set of observations { x } {y } where x 1 x 2 x = and y = { 1, +1} & x D The indicator variable, {y }, for each saple, y = +1 for exaples of the target class (P) y = 1 for all others classes (N) 6-4

The Perceptron will learn the coefficients for a linear boundary w = w 1 w 2 w D & and b Such that for all training data, x, w T x 0 for P and w T x < 0 for N. Note that w T X 0 is the sae as w T X b. Thus b can be considered as a threshold on the product : w T X The decision function is the d(z) function: d(z) = P if z 0 N if z < 0 The algorith requires a learning rate, α. Note that a training saple is correctly classified if: y w T x ( ) 0 The perceptron algorith uses isclassified training data as a correction. The algorith will continue to loop thugh the training data until it akes an entire pass without a single iss-classified training saple. If the training data are not separable then it will continue to loop forever. Update rule: FOR = 1 TO DO IF y w T x w w + y ( ) < 0 THEN x b b + y R where R is a scale factor for the data R = ax x If the data is not separable, then the Perceptron will not converge, and continue to loop. Thus it is necessary to have a liit the nuber of iterations. The final classifier is: if w T x 0 then P else N. 6-5

Perceptron argin For a 2-Class classifier, the argin, γ, is the sallest separation between the two classes. The argin, γ, for each saple,, is = y w T x ( ) The argin for a classifier and a set of training data is the iniu argin of the training data = in y w T x { ( )} The quality of a perceptron is given by the histogra of the argins, h(γ) for the training data. Full perceptron algorith (Optional) For reference, here is the full perceptron algorith. Algorith: w (0) 0; b (0) 0, i 0; R = ax x WHILE update DO update FALSE; FOR = 1 TO DO IF y w (i)t x (i) ( ) < 0 THEN update TRUE w (i+1) w (i) + y b (i+1) b (i) + y R i i + 1 END IF END FOR END WHILE. x 6-6

Support Vector achines Support Vector achines (SV), also known as axiu argin classifiers are popular for probles of classification, regression and novelty detection. The solution of the odel paraeters corresponds to a convex optiization proble. SVs use a inial subset of the training data (the support vectors) to define the best decision surface between two classes. We will use the two class proble, K=2, to illustrate the principle. ulti-class solutions are possible. The siplest case, the hard argin SV, require that the training data be copletely separated by at least one hyper-plane. This is generally achieved by using a Kernel to ap the features into a high diensional space were the two classes are separable. To illustrate the principle, we will first exaine a siple linear SV where the data are separable. We will then generalize with Kernels and with soft argins. We will assue that the training data is a set of training saples indicator variable, { y }, where, y is -1 or +1. { x } with an Hard-argin SVs - a siple linear classifier. The siplest case is a linear classifier trained fro separable data. g( x ) = w T x with the decision rule is: IF g( x ) 0 THEN P else N For a hard argin SV we assue that the two classes are separable for all of the training data: : y ( w T x ) 0 We will use a subset S of the training saples, { x s } x saples to define the best decision surface g( x ) = w T x. { } coposed of s training The iniu nuber of support vectors depends on the nuber of features, D: s = D+1 The s selected training saples { x s } are called the support vectors. For exaple, in a 2D feature space we need only 3 training saples to serve as support vectors. 6-7

To define the classifier we will look for the subset if S training saples that axiizes the argin ( ) between the two classes. argin: = in y w T x { ( )} Thus we seek a subset of s =D+1 training saples, { x s } { x } to define a decision surface and the argin. We will use these saples as support vectors. Our algorith ust choose D+1 training saples { x s } fro the Training saples in { x } such that the argin is a large as possible. This is equivalent to a search for a pair of parallel surfaces a distance γ fro the decision surface. Finding the Support Vectors For any hyperplane with noralized the coefficients: w =1 the distance of any saple point x fro the hyper-plane, w is d = y w T x ( ) The argin is the iniu distance = in{y ( w T x )} For the D+1 support vectors x { x s } d = γ For all other training saples x { x s }: d γ The scale of the argin is deterined by w. To find the support vectors, we can arbitrarily define the argin as γ = 1 and then renoralize w =1 once the support vectors have been discovered. With γ = 1, we will look for two separate hyper-planes that bound the decision surface, such that for points on these surfaces: 6-8

w T x =1 and w T x = 1 The distance between these two planes is 2 w We will add the constraint that for all training saples that are support vectors w T x 1 while for all other saples: w T x < 1 This can be written as: y ( w T x ) 1 This is known as the Canonical Representation for the decision hyperplane. This gives we have an optiization algorith that iniizes w subject to y ( w T x ) 1. For points on the argin: y ( w T x ) =1 for all other points y ( w T x ) >1 The training saple where y ( w T x ) =1 are said to be active. All other training saples are inactive. 1 Thus the optiization proble is to axiize arg in w 2 & while iniizing the w,b 2 nuber of active points, S. (The factor of ½ is a convenience for analysis.) This can be solved as a quadratic optiization proble using Lagrange ultipliers, with one ultiplier, a 0, for each constraint. The Lagrangian function is: L( w,b, a ) = 1 2 Setting the derivatives to zero, we obtain: L w = 0 w = a y x =1 w 2 a y ( w T x ) 1 =1 { } L b = 0 b = 1 w T x y s x S 6-9

Soft argin SV We can extend the SV to the case where the training data are not separable by introducing the hinge loss function. { } hinge loss: L = ax 0,1 y ( w T x ) This loss is 0 is the data is separable, and non-zero if the data is non-separable. Using the hinge loss, we can iniize 1 ax 0,1 y ( & { w T x ) }) + * ( =1 w 2 To set this up as an optiization proble, we note that L = ax 0,1 y ( w T x ) { } is the sallest non-negative that satisfies y ( w T x ) 1 L we rewrite the optiization as iniize 1 L + w 2 subject to y ( w T x ) 1 L and L 0 for all. =1 This was classically solved as a quadratic optiization proble with Lagrange ultipliers. axiize subject to a 1 =1 2 i=1 j=1 y i a i ( x T i x j ) y j a j a y = 0 and 0 a 1 for all 2 =1 The resulting decision surface is : w = a y =1 and the bias is found by solving for: b = w T x y for any support vector: x { x s } odern approaches use Gradient Descent to solve for w and b, offering iportant advantages for large data sets. x 6-10

SV with Kernels Support Vector achines can be generalized for use with non-linear decision surfaces using kernel functions. This provides a very general and efficient classifier. A kernel function, f(z), transfors the data into a space where the two classes are separate. Instead of a decision surface: g( x ) = w T x We use a decision surface g( x ) = w T f ( x ) the coefficients w and the bias b are provided by the D+1 support vectors. For this we need to include the where w = a y f ( x ) =1 is learned fro the transfored training data. As before, S = x s { } are the support vectors, and a is a variable learned fro the training data that is a 0 for support vectors and 0 for all others. 6-11

Radial Basis Function Kernels: Radial Basis functions are a popular kernel function: A radial basis function (RBF) is a real-valued function whose value depends only on the distance fro the origin. For Exaple, the Gaussian function f (z) = e is a popular Radial Basis Function, and is often used as a kernel for support vector achines, with radial basis functions: z = X x s for each support vectors. We can use a su of s radial basis functions to define a discriinant function, where the support vectors are drawn fro the training saples. This gives a discriinant function g( X ) = a y f ( X x ), =1 The training saples X for which a 0are the support vectors. Depending on σ, this can provide a good fit or an over fit to the data. If σ is large copared to the distance between the classes, this can give an overly flat discriinant surface. If σ is sall copared to the distance between classes, this will over-fit the saples. A good choice for σ will be coparable to the distance between the closest ebers of the two classes. z 2 2 2 (iages fro A Coputational Biology Exaple using Support Vector achines, Suzy Fei, 2009) Each Radial Basis Function is a diension in a high diensional basis space. 6-12

Kernel Functions for Sybolic Data Kernel functions can be defined over graphs, sets, strings and text Consider for exaple, a non-vector space coposed of a set of words {W}. We can select a subset of discriinant words {S} {W} Now given a set of words (a probe), {A} {W} We can define a kernel function of A and S using the intersection operation. k(a,s) = 2 AS where. denotes the cardinality (the nuber of eleents) of a set. 6-13