Linear Models for Classification

Linear Models for Classification Henrik I Christensen Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu Henrik I Christensen (RIM@GT) Linear Classification 1 / 33

Outline 1 Introduction 2 Linear Discriminant Functions 3 LSQ for Classification 4 Fisher s Discriminant Method 5 Perceptrons 6 Summary Henrik I Christensen (RIM@GT) Linear Classification 2 / 33

Introduction Last time: prediction of new functional values Today: linear classification of data Basic pattern recognition Separation of data: buy/sell Segmentation of line data,... Henrik I Christensen (RIM@GT) Linear Classification 3 / 33

Simple Example - Bolts or Needles Henrik I Christensen (RIM@GT) Linear Classification 4 / 33

Classification Given An input vector: X A set of classes: c i C, Mapping m : X C i = 1,..., k Separation of space into decision regions Boundaries termed decision boundaries/surfaces Henrik I Christensen (RIM@GT) Linear Classification 5 / 33

Basis Formulation It is a 1-of-K coding problem Target vector: t = (0,..., 1,..., 0) Consideration of 3 different approaches 1 Optimization of discriminant function 2 Bayesian Formulation: p(c i x) 3 Learning & Decision fusion Henrik I Christensen (RIM@GT) Linear Classification 6 / 33

Code for experimentation There are data sets and sample code available NETLAB: http://www.ncrg.aston.ac.uk/netlab/index.php NAVTOOLBOX: http://www.cas.kth.se/toolbox/ SLAM Dataset: http://kaspar.informatik.uni-freiburg.de/ ~slamevaluation/datasets.php Henrik I Christensen (RIM@GT) Linear Classification 7 / 33

Outline 1 Introduction 2 Linear Discriminant Functions 3 LSQ for Classification 4 Fisher s Discriminant Method 5 Perceptrons 6 Summary Henrik I Christensen (RIM@GT) Linear Classification 8 / 33

Discriminant Functions Objective: input vector x assigned to a class c i Simple formulation: y(x) = w T x + w 0 w is termed a weight vector w 0 is termed a bias Two class example: c 1 if y(x) 0 otherwise c 2 Henrik I Christensen (RIM@GT) Linear Classification 9 / 33

Basic Design Two points on decision surface x a and x b y(x a ) = y(x b ) = 0 w T (x a x b ) = 0 w perpendicular to decision surface w T x w = w 0 w Define: w = (w 0, w) and x = (1, x) so that: y(x) = w T x Henrik I Christensen (RIM@GT) Linear Classification 10 / 33

Linear discriminant function x 2 y > 0 y = 0 y < 0 R 1 R 2 w x y(x) w x x 1 w 0 w Henrik I Christensen (RIM@GT) Linear Classification 11 / 33

Multi Class Discrimination Generation of multiple decision functions y k (x) = w T k x + w k0 Decision strategy j = arg max i 1..k y i(x) Henrik I Christensen (RIM@GT) Linear Classification 12 / 33

Multi-Class Decision Regions R j R i R k x A x x B Henrik I Christensen (RIM@GT) Linear Classification 13 / 33

Example - Bolts or Needles Henrik I Christensen (RIM@GT) Linear Classification 14 / 33

Minimum distance classification Suppose we have computed the mean value for each of the classes m needle = [0.86, 2.34] T and m bolt = [5.74, 5, 85] T We can then compute the minimum distance d j (x) = x m j argmin i d i (x) is the best fit Decision functions can be derived Henrik I Christensen (RIM@GT) Linear Classification 15 / 33

Bolts / Needle Decision Functions Needle d needle (x) = 0.86x 1 + 2.34x 2 3.10 Bolt d bolt (x) = 5.74x 1 + 5.85x 2 33.59 Decision boundary d i (x) d j (x) = 0 d needle/bolt (x) = 4.88x 1 3.51x 2 + 30.49 Henrik I Christensen (RIM@GT) Linear Classification 16 / 33

Example decision surface Henrik I Christensen (RIM@GT) Linear Classification 17 / 33

Outline 1 Introduction 2 Linear Discriminant Functions 3 LSQ for Classification 4 Fisher s Discriminant Method 5 Perceptrons 6 Summary Henrik I Christensen (RIM@GT) Linear Classification 18 / 33

Least Squares for Classification Just like we could do LSQ for regression we can perform an approximation to the classification vector C Consider again y k (x) = w T k x + w k0 Rewrite to y(x) = W T x Assuming we have a target vector T Henrik I Christensen (RIM@GT) Linear Classification 19 / 33

Least Squares for Classification The error is then: E D ( W) = 1 {( X 2 Tr W T) T ( X W } T) The solution is then 1 W = ( XT X) XT T Henrik I Christensen (RIM@GT) Linear Classification 20 / 33

LSQ and Outliers 4 4 2 2 0 0 2 2 4 4 6 6 8 4 2 0 2 4 6 8 8 4 2 0 2 4 6 8 Henrik I Christensen (RIM@GT) Linear Classification 21 / 33

Outline 1 Introduction 2 Linear Discriminant Functions 3 LSQ for Classification 4 Fisher s Discriminant Method 5 Perceptrons 6 Summary Henrik I Christensen (RIM@GT) Linear Classification 22 / 33

Fisher s linear discriminant Selection of a decision function that maximizes distance between classes Assume for a start Compute m 1 and m 2 y = W T x m 1 = 1 N 1 i C 1 x i m 2 = 1 N 2 j C 2 x j Distance: m 2 m 1 = w T (m 2 m 1 ) where m i = wm i Henrik I Christensen (RIM@GT) Linear Classification 23 / 33

The suboptimal solution 4 2 0 2 2 2 6 Henrik I Christensen (RIM@GT) Linear Classification 24 / 33

The Fisher criterion Consider the expression J(w) = wt S B w w T S W w where S B is the between class covariance and S W is the within class covariance, i.e. S B = (m 1 m 2 )(m 1 m 2 ) T and S W = i=c 1 (x i m 1 )(x i m 1 ) T + i=c 2 (x i m 2 )(x i m 2 ) T Optimized when (w T S B w)s w w = (w T S W w)s B w or w S 1 w (m 2 m 1 ) Henrik I Christensen (RIM@GT) Linear Classification 25 / 33

The Fisher result 4 2 0 2 2 2 6 Henrik I Christensen (RIM@GT) Linear Classification 26 / 33

Generalization to N 2 Define a stacked weight factor y = W T x The within class covariance generalizes to S w = K k=1 The between class covariance is K S B = N k (m k m)(m k m) T k=1 It can be shown that J(w) is optimized by the eigenvectors to the equation S = S 1 W S B Henrik I Christensen (RIM@GT) Linear Classification 27 / 33 S k

Outline 1 Introduction 2 Linear Discriminant Functions 3 LSQ for Classification 4 Fisher s Discriminant Method 5 Perceptrons 6 Summary Henrik I Christensen (RIM@GT) Linear Classification 28 / 33

Perceptron Algorithm Developed by Rosenblatt (1962) Formed an important basis for neural networks Use a non-linear transformation φ(x) Construct a decision function y(x) = f ( ) w T φ(x) where f (a) = { +1, a 0 1, a < 0 Henrik I Christensen (RIM@GT) Linear Classification 29 / 33

The perceptron criterion Normally we want w t φ(x n ) > 0 Given the target vector definition E p (w) = w T φ n t n n inm Where M represents all the mis-classified samples We can make this a gradient descent as seen in last lecture w (τ+1) = w (τ) η E P (w) = w (τ) + ηφ n t n Henrik I Christensen (RIM@GT) Linear Classification 30 / 33

Perceptron learning example 1 1 0.5 0.5 0 0 0.5 0.5 1 1 0.5 0 0.5 1 1 1 0.5 0 0.5 1 1 1 0.5 0.5 0 0 0.5 0.5 1 1 0.5 0 0.5 1 1 1 0.5 0 0.5 1 Henrik I Christensen (RIM@GT) Linear Classification 31 / 33

Outline 1 Introduction 2 Linear Discriminant Functions 3 LSQ for Classification 4 Fisher s Discriminant Method 5 Perceptrons 6 Summary Henrik I Christensen (RIM@GT) Linear Classification 32 / 33

Summary Basics for discrimination / classification Obviously not all problems are linear Optimization of the distance/overlap between classes Minimizing the probability of error classification Basic formulation as an optimization problem How to optimize between cluster distance? Covariance Weighted Basic recursive formulation Could we make it more robust? Henrik I Christensen (RIM@GT) Linear Classification 33 / 33