Linear Discriminant Functions

Similar documents
Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Preliminaries. Definition: The Euclidean dot product between two vectors is the expression. i=1

6.867 Machine learning

Linear Discrimination Functions

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Neural networks and support vector machines

Linear models and the perceptron algorithm

LECTURE # - NEURAL COMPUTATION, Feb 04, Linear Regression. x 1 θ 1 output... θ M x M. Assumes a functional form

A vector from the origin to H, V could be expressed using:

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

In the Name of God. Lecture 11: Single Layer Perceptrons

Single layer NN. Neuron Model

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Multilayer Feedforward Networks. Berlin Chen, 2002

Lecture 3: Pattern Classification. Pattern classification

Linear Classifiers as Pattern Detectors

Enhancing Generalization Capability of SVM Classifiers with Feature Weight Adjustment

Logistic Regression. Machine Learning Fall 2018

Pattern Recognition and Machine Learning. Perceptrons and Support Vector machines

Part 8: Neural Networks

Multilayer Neural Networks

Computational Intelligence Lecture 3: Simple Neural Networks for Pattern Classification

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Artificial Neuron (Perceptron)

Deep Learning for Computer Vision

Machine Learning Basics III

L20: MLPs, RBFs and SPR Bayes discriminants and MLPs The role of MLP hidden units Bayes discriminants and RBFs Comparison between MLPs and RBFs

Notes on Discriminant Functions and Optimal Classification

Artificial Neural Networks. Part 2

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Learning Methods for Linear Detectors

The Perceptron Algorithm 1

Linear Classifiers as Pattern Detectors

Machine Learning Lecture 7

The Perceptron Algorithm

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008

ADALINE for Pattern Classification

Kernelized Perceptron Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

The Perceptron. Volker Tresp Summer 2014

Machine Learning 2017

Networks of McCulloch-Pitts Neurons

MRC: The Maximum Rejection Classifier for Pattern Detection. With Michael Elad, Renato Keshet

More about the Perceptron

INF Introduction to classifiction Anne Solberg Based on Chapter 2 ( ) in Duda and Hart: Pattern Classification

Optimization Methods: Optimization using Calculus - Equality constraints 1. Module 2 Lecture Notes 4

GRADIENT DESCENT AND LOCAL MINIMA

Machine Learning: The Perceptron. Lecture 06

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Machine Learning Lecture 5

Linear Models for Classification

Inf2b Learning and Data

PMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Single Layer Percetron

Machine Learning : Support Vector Machines

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Linear & nonlinear classifiers

SGN (4 cr) Chapter 5

Machine Learning: Fisher s Linear Discriminant. Lecture 05

Ridge Regression 1. to which some random noise is added. So that the training labels can be represented as:

2 Linear Classifiers and Perceptrons

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Lecture 6. Notes on Linear Algebra. Perceptron

Multiclass Classification-1

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

Issues and Techniques in Pattern Classification

Linear & nonlinear classifiers

STA 414/2104, Spring 2014, Practice Problem Set #1

Support Vector Machine

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Section 3.3 Graphs of Polynomial Functions

Support Vector Machine. Industrial AI Lab.

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Computational Intelligence Winter Term 2017/18

Classification with Perceptrons. Reading:

Brief Introduction to Machine Learning

Revision: Neural Network

Multilayer Neural Networks

Multilayer Neural Networks

Neural Networks Lecture 2:Single Layer Classifiers

Day 4: Classification, support vector machines

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Computational Intelligence

The Perceptron. Volker Tresp Summer 2016

Mining Classification Knowledge

Support Vector Machines

Chapter 2 Single Layer Feedforward Networks

Pattern Classification

12.10 Lagrange Multipliers

INF Anne Solberg One of the most challenging topics in image analysis is recognizing a specific object in an image.

Machine Learning Practice Page 2 of 2 10/28/13

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

The Perceptron algorithm

18.9 SUPPORT VECTOR MACHINES

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Support Vector Machine (continued)

Statistical Geometry Processing Winter Semester 2011/2012

Transcription:

Linear Discriminant Functions

Linear discriminant functions and decision surfaces Definition It is a function that is a linear combination of the components of g() = t + 0 () here is the eight vector and 0 the bias A to-category classifier ith a discriminant function of the form () uses the folloing rule: Decide ω if g() > 0 and ω if g() < 0 Decide ω if t > - 0 and ω otherise If g() = 0 is assigned to either class

The equation g() = 0 defines the decision surface that separates points assigned to the category ω from points assigned to the category ω When g() is linear, the decision surface is a hyperplane Algebraic measure of the distance from to the hyperplane (interesting result!) 3

4

= p + r since g( p ) = 0 and t = # g() = t + 0 " t % p + r $ & ( + 0 ' p r = g( p ) + t r " r = g() t H in particular d([0,0],h) = 0 In conclusion, a linear discriminant function divides the feature space by a hyperplane decision surface The orientation of the surface is determined by the normal vector and the location of the surface is determined by the bias 5

The multi-category case We define c linear discriminant functions and assign to ω i if g i () > g j () j i; in case of ties, the classification is undefined In this case, the classifier is a linear machine A linear machine divides the feature space into c decision regions, ith g i () being the largest discriminant if is in the region R i For a to contiguous regions R i and R j ; the boundary that separates them is a portion of hyperplane H ij defined by: g i () = g j () i j is normal to H ij and g i () = i t + i0 ( i j ) t + ( i0 j0 ) = 0 d(, H ij ) = g i i!! g j j i =,...,c 6

7

Generalized Linear Discriminant Functions Decision boundaries hich separate beteen classes may not alays be linear The compleity of the boundaries may sometimes request the use of highly non-linear surfaces A popular approach to generalize the concept of linear decision functions is to consider a generalized decision function as: g() = f () + f () + + N f N () + N+ () here f i (), i N are scalar functions of the pattern, R n (Euclidean Space) 8

Introducing f n+ () = e get: N + " i= g() = i f i () = T.y here and = (,,..., N, N + ) T y = (f (), f (),..., f N (), f N + ()) T This latter representation of g() implies that any decision function defined by equation () can be treated as linear in the (N + ) dimensional space (N + > n) g() maintains its non-linearity characteristics in R n 9

The most commonly used generalized decision function is g() for hich f i () ( i N) are polynomials g() = 0 + " i i + " # ij i j + " " $ ijk i j k +... i=:n i=:n " j= 0 k=:n i=:n " j=:n Quadratic decision functions for a -dimensional feature space g() = + + 3 + 4 + 5 + 6 here : = (,,..., 6 ) T and y = (,,,,,) T 0

For patterns R n, the most general quadratic decision function is given by: n # i= n$ # i= n # j= i+ n # i= g() = " ii i + " ij i j + i i + n + () The number of terms at the right-hand side is: l = N + n(n ") = n + + n + (n +)(n + ) = This is the total number of eights hich are the free parameters of the problem n = 3 => 0-dimensional n = 0 => 65-dimensional

In the case of polynomial decision functions of order m, a typical f i () is given by: f i () = i e i e... im e m here " i,i,...,i m " n and e i, " i " m is 0 or. It is a polynomial ith a degree beteen 0 and m. To avoid repetitions, i i i m n n # # i = i = i n # g m () =... i i...i m i i... im + g m" () i m = i m" here g m () is the most general polynomial decision function of order m

3 Eample : Let n = 3 and m = then: Eample : Let n = and m = 3 then: 4 3 3 3 33 3 3 3 3 3 i i 4 3 3 i i i i 3 i ) ( g + + + + + + + + + = + + + + =!! = = 3 i i i i i i i 3 3 i i i i i i i i i i i 3 ) ( g ) ( here g ) ( g ) ( g ) ( g 3 3 3 + + + + + = + = + + + + = + =!!!!! = = = = =

4

5

Augmentation Incorporate labels into data Margin 7

8

Learning Linear Classifiers

Basic Ideas Directly fit linear boundary in feature space. ) Define an error function for classification Number of errors? Total distance of mislabeled points from boundary? Least squares error Within/beteen class scatter Optimize the error function on training data Gradient descent Modified gradient descent Various search proedures. Ho much data is needed? 0

To-category case Given,,, n sample points, ith true category labels: α, α,,α n " i = $ if point i is from class ' % " i = #& if point i is from class ' Decision are made according to: if a t i > 0 if a t i < 0 class " is chosen class " is chosen No these decisions are rong hen a t i is negative and belongs to class ω. Let y i = α i i Then y i >0 hen correctly labelled, negative otherise.

Separating vector (Solution vector) Find a such that a t y i >0 for all i. a t y i =0 defines a hyperplane ith a as normal

Problem: solution not unique Add Additional constraints: Margin, the distance aay from boundary a t y i >b 3

Margins in data space b Larger margins promote uniqueness for underconstrained problems 4

Finding a solution vector by minimizing an error criterion What error function? Ho do e eight the points? All the points or only error points? Only error points: Perceptron Criterion J P (a t ) = $ "(a t y i ) Y = {y i a t y i < 0} y i #Y Perceptron Criterion ith Margin J P (a t ) = $ "(a t y i ) Y = {y i a t y i < b} y i #Y 5

6

Minimizing Error via Gradient Descent 7

The min(ma) problem: min f ( ) But e learned in calculus ho to solve that kind of question! 8

Motivation Not eactly, Functions: High order polynomials: f : R n!! +! 6 3 0 5 5040 7 R What about function that don t have an analytic presentation: Black Bo 9

f := (, y)! " % cos$ " % ' # $ & cos ' # y & 30

Directional Derivatives: first, the one dimension derivative:! 3

Directional Derivatives : Along the Aes! f (, y)! y! f (, y)! 3

Directional Derivatives : In general direction v! R v =! f (, y)! v 33

Directional Derivatives! f (, y)! y! f (, y)! 34

The Gradient: Definition in R f : R! R ( f (, y) : = & $$ % ' f ' ' f ' y #!! " In the plane 35

The Gradient: Definition f : R n! R & ' f ( f (,..., ) : = $,..., n % ' ' f ' n #! " 36

The Gradient Properties The gradient defines (hyper) plane approimating the function infinitesimally # f # f! z = "! + "! y # # y 37

The Gradient properties Computing directional derivatives By the chain rule: (important for later use) v = " f ( p ) = (! f ) p, v Want rate of change along ν, at a point p " v 38

39 The Gradient properties is maimal choosing is minimal choosing (intuition: the gradient points to the greatest change in direction) v f!! p p f f v ) ( ) (! "! = p p f f v ) ( ) (! "! # =

The Gradient Properties f : R n "# R Let be a smooth function around P, if f has local minimum (maimum) at p then, ("f ) p = r 0 (Necessary for local min(ma)) 40

The Gradient Properties Intuition 4

The Gradient Properties Formally: for any v! R n \{0} We get: df ( p + t " v) 0 = dt r $ (#f ) = 0 p = (#f ) p,v 4

The Gradient Properties We found the best INFINITESIMAL DIRECTION at each point, Looking for minimum using blind man procedure Ho can get to the minimum using this knoledge? 43

Steepest Descent Algorithm Initialization: loop hile compute search direction update end loop! R n 0 "f ( ) > # i i+ = i " # $ h i h i = "f ( i ) 44

Setting the Learning Rate Intelligently Minimizing this approimation: 45

Minimizing Perceptron Criterion Perceptron Criterion J P (a t ) = $ "a t y i Y = {y i a t y i < 0} y i #Y % ( a J P (a t )) = % $ a "a t y i y i #Y $ y i #Y = "y i Which gives the update rule : $ a( j +) = a(k) +& k y i y i #Y k Where Y k is the misclassified set at step k 46

Batch Perceptron Update 47

48

Simplest Case 49

When does perceptron rule converge? 50

+ + 5

Different Error Functions Four learning criteria as a function of eights in a linear classifier. At the upper left is the total number of patterns misclassified, hich is pieceise constant and hence unacceptable for gradient descent procedures. At the upper right is the Perceptron criterion (Eq. 6), hich is pieceise linear and acceptable for gradient descent. The loer left is squared error (Eq. 3), hich has nice analytic properties and is useful even hen the patterns are not linearly separable. The loer right is the square error ith margin. A designer may adjust the margin in order to force the solution vector to lie toard the middle of the = 0 solution region in hopes of improving generalization of the resulting classifier. 5

Three Different issues ) Class boundary epressiveness ) Error definition 3) Learning method 53

Fisher s linear disc. 54

Fisher s Linear Discriminant Intuition 55

We are looking for a good projection Write this in terms of So, Net, 56

Define the scatter matri: Then, Thus the denominator can be ritten: 57

Rayleigh Quotient Maimizing equivalent to solving: With solution: 58

59

60

6

6