Linear Discriminant Functions

Linear discriminant functions and decision surfaces Definition It is a function that is a linear combination of the components of g() = t + 0 () here is the eight vector and 0 the bias A to-category classifier ith a discriminant function of the form () uses the folloing rule: Decide ω if g() > 0 and ω if g() < 0 Decide ω if t > - 0 and ω otherise If g() = 0 is assigned to either class

The equation g() = 0 defines the decision surface that separates points assigned to the category ω from points assigned to the category ω When g() is linear, the decision surface is a hyperplane Algebraic measure of the distance from to the hyperplane (interesting result!) 3

= p + r since g( p ) = 0 and t = # g() = t + 0 " t % p + r $ & ( + 0 ' p r = g( p ) + t r " r = g() t H in particular d([0,0],h) = 0 In conclusion, a linear discriminant function divides the feature space by a hyperplane decision surface The orientation of the surface is determined by the normal vector and the location of the surface is determined by the bias 5

The multi-category case We define c linear discriminant functions and assign to ω i if g i () > g j () j i; in case of ties, the classification is undefined In this case, the classifier is a linear machine A linear machine divides the feature space into c decision regions, ith g i () being the largest discriminant if is in the region R i For a to contiguous regions R i and R j ; the boundary that separates them is a portion of hyperplane H ij defined by: g i () = g j () i j is normal to H ij and g i () = i t + i0 ( i j ) t + ( i0 j0 ) = 0 d(, H ij ) = g i i!! g j j i =,...,c 6

Generalized Linear Discriminant Functions Decision boundaries hich separate beteen classes may not alays be linear The compleity of the boundaries may sometimes request the use of highly non-linear surfaces A popular approach to generalize the concept of linear decision functions is to consider a generalized decision function as: g() = f () + f () + + N f N () + N+ () here f i (), i N are scalar functions of the pattern, R n (Euclidean Space) 8

Introducing f n+ () = e get: N + " i= g() = i f i () = T.y here and = (,,..., N, N + ) T y = (f (), f (),..., f N (), f N + ()) T This latter representation of g() implies that any decision function defined by equation () can be treated as linear in the (N + ) dimensional space (N + > n) g() maintains its non-linearity characteristics in R n 9

The most commonly used generalized decision function is g() for hich f i () ( i N) are polynomials g() = 0 + " i i + " # ij i j + " " $ ijk i j k +... i=:n i=:n " j= 0 k=:n i=:n " j=:n Quadratic decision functions for a -dimensional feature space g() = + + 3 + 4 + 5 + 6 here : = (,,..., 6 ) T and y = (,,,,,) T 0

For patterns R n, the most general quadratic decision function is given by: n # i= n$ # i= n # j= i+ n # i= g() = " ii i + " ij i j + i i + n + () The number of terms at the right-hand side is: l = N + n(n ") = n + + n + (n +)(n + ) = This is the total number of eights hich are the free parameters of the problem n = 3 => 0-dimensional n = 0 => 65-dimensional

In the case of polynomial decision functions of order m, a typical f i () is given by: f i () = i e i e... im e m here " i,i,...,i m " n and e i, " i " m is 0 or. It is a polynomial ith a degree beteen 0 and m. To avoid repetitions, i i i m n n # # i = i = i n # g m () =... i i...i m i i... im + g m" () i m = i m" here g m () is the most general polynomial decision function of order m

3 Eample : Let n = 3 and m = then: Eample : Let n = and m = 3 then: 4 3 3 3 33 3 3 3 3 3 i i 4 3 3 i i i i 3 i ) ( g + + + + + + + + + = + + + + =!! = = 3 i i i i i i i 3 3 i i i i i i i i i i i 3 ) ( g ) ( here g ) ( g ) ( g ) ( g 3 3 3 + + + + + = + = + + + + = + =!!!!! = = = = =

Augmentation Incorporate labels into data Margin 7

Learning Linear Classifiers

Basic Ideas Directly fit linear boundary in feature space. ) Define an error function for classification Number of errors? Total distance of mislabeled points from boundary? Least squares error Within/beteen class scatter Optimize the error function on training data Gradient descent Modified gradient descent Various search proedures. Ho much data is needed? 0

To-category case Given,,, n sample points, ith true category labels: α, α,,α n " i = $ if point i is from class ' % " i = #& if point i is from class ' Decision are made according to: if a t i > 0 if a t i < 0 class " is chosen class " is chosen No these decisions are rong hen a t i is negative and belongs to class ω. Let y i = α i i Then y i >0 hen correctly labelled, negative otherise.

Separating vector (Solution vector) Find a such that a t y i >0 for all i. a t y i =0 defines a hyperplane ith a as normal

Problem: solution not unique Add Additional constraints: Margin, the distance aay from boundary a t y i >b 3

Margins in data space b Larger margins promote uniqueness for underconstrained problems 4

Finding a solution vector by minimizing an error criterion What error function? Ho do e eight the points? All the points or only error points? Only error points: Perceptron Criterion J P (a t ) = $ "(a t y i ) Y = {y i a t y i < 0} y i #Y Perceptron Criterion ith Margin J P (a t ) = $ "(a t y i ) Y = {y i a t y i < b} y i #Y 5

Minimizing Error via Gradient Descent 7

The min(ma) problem: min f ( ) But e learned in calculus ho to solve that kind of question! 8

Motivation Not eactly, Functions: High order polynomials: f : R n!! +! 6 3 0 5 5040 7 R What about function that don t have an analytic presentation: Black Bo 9

f := (, y)! " % cos$ " % ' # $ & cos ' # y & 30

Directional Derivatives: first, the one dimension derivative:! 3

Directional Derivatives : Along the Aes! f (, y)! y! f (, y)! 3

Directional Derivatives : In general direction v! R v =! f (, y)! v 33

Directional Derivatives! f (, y)! y! f (, y)! 34

The Gradient: Definition in R f : R! R ( f (, y) : = & $$ % ' f ' ' f ' y #!! " In the plane 35

The Gradient: Definition f : R n! R & ' f ( f (,..., ) : = $,..., n % ' ' f ' n #! " 36

The Gradient Properties The gradient defines (hyper) plane approimating the function infinitesimally # f # f! z = "! + "! y # # y 37

The Gradient properties Computing directional derivatives By the chain rule: (important for later use) v = " f ( p ) = (! f ) p, v Want rate of change along ν, at a point p " v 38

39 The Gradient properties is maimal choosing is minimal choosing (intuition: the gradient points to the greatest change in direction) v f!! p p f f v ) ( ) (! "! = p p f f v ) ( ) (! "! # =

The Gradient Properties f : R n "# R Let be a smooth function around P, if f has local minimum (maimum) at p then, ("f ) p = r 0 (Necessary for local min(ma)) 40

The Gradient Properties Intuition 4

The Gradient Properties Formally: for any v! R n \{0} We get: df ( p + t " v) 0 = dt r $ (#f ) = 0 p = (#f ) p,v 4

The Gradient Properties We found the best INFINITESIMAL DIRECTION at each point, Looking for minimum using blind man procedure Ho can get to the minimum using this knoledge? 43

Steepest Descent Algorithm Initialization: loop hile compute search direction update end loop! R n 0 "f ( ) > # i i+ = i " # $ h i h i = "f ( i ) 44

Setting the Learning Rate Intelligently Minimizing this approimation: 45

Minimizing Perceptron Criterion Perceptron Criterion J P (a t ) = $ "a t y i Y = {y i a t y i < 0} y i #Y % ( a J P (a t )) = % $ a "a t y i y i #Y $ y i #Y = "y i Which gives the update rule : $ a( j +) = a(k) +& k y i y i #Y k Where Y k is the misclassified set at step k 46

Batch Perceptron Update 47

Simplest Case 49

When does perceptron rule converge? 50

+ + 5

Different Error Functions Four learning criteria as a function of eights in a linear classifier. At the upper left is the total number of patterns misclassified, hich is pieceise constant and hence unacceptable for gradient descent procedures. At the upper right is the Perceptron criterion (Eq. 6), hich is pieceise linear and acceptable for gradient descent. The loer left is squared error (Eq. 3), hich has nice analytic properties and is useful even hen the patterns are not linearly separable. The loer right is the square error ith margin. A designer may adjust the margin in order to force the solution vector to lie toard the middle of the = 0 solution region in hopes of improving generalization of the resulting classifier. 5

Three Different issues ) Class boundary epressiveness ) Error definition 3) Learning method 53

Fisher s linear disc. 54

Fisher s Linear Discriminant Intuition 55

We are looking for a good projection Write this in terms of So, Net, 56

Define the scatter matri: Then, Thus the denominator can be ritten: 57

Rayleigh Quotient Maimizing equivalent to solving: With solution: 58