Linear Discrimination Functions

Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009

Outline Linear models Gradient descent Perceptron Minimum square error approach Linear and logistic regression

Linear Discriminant Functions I A linear discriminant function can be written as g(x) = w 1 x 1 + + w d x d + w 0 = w t x + w 0 where w is the weight vector w 0 is the bias or threshold A 2-class linear classifier implements the decision rule: Decide ω 1 if g(x) > 0 and ω 2 if g(x) < 0

Linear Discriminant Functions II The equation g(x) = 0 defines the decision surface that separates points assigned to ω 1 from points assigned to ω 2. When g(x) is linear, this decision surface is a hyperplane (H).

Linear Discriminant Functions III H divides the feature space into 2 half spaces: R 1 for 1, and R 2 for 2 If x 1 and x 2 are both on the decision surface w t x 1 + w 0 = w t x 2 + w 0 w t ( x 1 x 2 ) = 0 w is normal to any vector lying in the hyperplane

Linear Discriminant Functions IV If we express x as w x = x p + r w where x p is the normal projection of x onto H, and r is the algebraic distance from x to the hyperplane Since g( x p ) = 0, we have g( x) = w t x + w 0 = r w i.e. r = g( x) w r is signed distance: r > 0 if x falls in R 1, r < 0 if x falls in R 2 Distance from the origin to the hyperplane is w 0 w

Linear Discriminant Functions V

Multi-category Case I 2 approaches to extend the LDF approach to the multi-category case: ω i / not ω i ω i / ω j Reduce the problem to c 1 two-class problems: Problem #i: Find the functions that separates points assigned to ω i from those not assigned to ω i Find the c(c 1)/2 linear discriminants, one for every pair of classes Both approaches can lead to regions in which the classification is undefined

Multi-category Case II

Pairwise Classification Idea: build model for each pair of classes, using only training data from those classes Problem: solve c(c 1)/2 classification problems for c classes Turns out not to be a problem in many cases because training sets become small: Assume data evenly distributed, i.e. 2n/c per learning problem for n instances in total Suppose learning algorithm is linear in n Then runtime of pairwise classification is proportional to c(c 1) 2 2n c = (c 1)n

Linear Machine I Define c linear discriminant functions: g i ( x) = w t i x + w i0 i = 1,..., c Linear Machine classifier: x ω i if g i ( x) > g j ( x) for all i j In case of equal scores, the classification is undefined A LM divides the feature space into c decision regions, with g i ( x) the largest discriminant if x is in R i If R i and R j are contiguous, the boundary between them is a portion of the hyperplane H ij defined by: g i ( x) = g j ( x) or ( w i w j ) t x + (w i0 w j0 )

Linear Machine II It follows that w i w j is normal to H ij The signed distance from x to H ij is: g i ( x) g j ( x) w i w j There are c(c 1)/2 pairs of convex regions Not all regions are contiguous, and the total number of segments in the surfaces is often less than c(c 1)/2 3- and 5-class problems

Generalized LDF I The LDF is g( x) = w 0 + d i=1 w ix i Adding d(d + 1)/2 terms involving the products of pairs of components of x, quadratic discriminant function: g( x) = w 0 + d w i x i + i=1 d i=1 j=1 d w ij x i x j The separating surface defined by g( x) = 0 is a second-degree or hyperquadric surface Add more terms, such as w ijk x i x j x k, we obtain polynomial discriminant functions

Generalized LDF II The generalized LDF is defined g( x) = ˆd i=1 a i y i ( x) = a t y where: a is a ˆd-dimensional weight vector and y i ( x) are arbitrary functions of x The resulting discriminant function is not linear in x, but it is linear in y The functions y i ( x) map points in d-dimensional x-space to points in the ˆd-dimensional y-space

Generalized LDF III Example: Let the QDF be g( x) = a 1 + a 2 x + a 3 x 2 1 The 3-dimensional vector is then y = x x 2

2-class Linearly-Separable Case I g( x) = d w i x i = a t y i=0 where x 0 = 1 and y t = [1 x] = [1 x 1 x d ] is an augmented feature vector and a t = [w 0 w] = [w 0 w 1 w d ] is an augmented weight vector The hyperplane decision surface H defined a t y = 0 passes through the origin in y-space The distance from any point y to H is given by at y a = g( x) a Because a = (1 + w 2 ) this distance is less then the distance from x to H

2-class Linearly-Separable Case II Problem: find [w 0 w] = a Suppose that we have a set of n examples { y 1,..., y n } labeled ω 1 or ω 2 Look for a weight vector a that classifies all the examples correctly: a t y i > 0 and y i is labeled ω 1 or a t y i < 0 and y i is labeled ω 2 If a exists, the examples are linearly separable

2-class Linearly-Separable Case III Solutions Replacing all the examples labeled ω 2 by their negatives, one can look for a weight vector a such that a t y i > 0 for all the examples a a.k.a. separating vector or solution vector Each example y i places a constraint on the possible location of a solution vector a t y i = 0 defines a hyperplane through the origin having y i as a normal vector The solution vector (if it exists) must be on the positive side of every hyperplane Solution Region = intersection of the n half-spaces

2-class Linearly-Separable Case IV Any vector that lies in the solution region is a solution vector: the solution vector (if it exists) is not unique Additional requirements to find a solution vector closer to the middle of the region (i.e. more likely to classify new examples correctly) Seek a unit-length weight vector that maximizes the minimum distance from the examples to the hyperplane

2-class Linearly-Separable Case V Seek the minimum-length weight vector satisfying a t y i b 0 The solution region shrinks by margin: b/ y i

Gradient Descent I Define a criterion function J( a) that is minimized when a is a solution vector: a t y i 0, i = 1,..., n Start with some arbitrary vector a(1) Compute the gradient vector J( a(1)) The next value a(2) is obtained by moving a distance from a(1) in the direction of steepest descent i.e. along the negative of the gradient In general, a(k + 1) is obtained from a(k) using a(k + 1) a(k) η(k) J( a(k)) where η(k) is the learning rate

Gradient Descent II

Gradient Descent & Delta Rule I To understand, consider a simpler linear machine (a.k.a. unit), where o = w 0 + w 1 x 1 + + w n x n Let s learn w i s that minimize the squared error, i.e. J(w) = E[ w] where: E[ w] 1 ( t d o d ) 2 2 d D D is set of training examples x, t t is the target output value

Gradient Descent & Delta Rule II Gradient E[ w] [ E, E, E ] w 0 w 1 w n Training rule: i.e., w = η E[ w] w i = η E w i Note that η may be a constant

Gradient Descent & Delta Rule III E = 1 w i w i 2 = 1 2 = 1 2 d d (t d o d ) 2 d w i (t d o d ) 2 2(t d o d ) w i (t d o d ) = (t d o d ) (t d w x d ) w i d E w i = (t d o d )( x id ) d

Basic GRADIENT-DESCENT Algorithm GRADIENT-DESCENT(D, η) D: training set, η: learning rate (e.g..5) Initialize each w i to some small random value until the termination condition is met do Initialize each w i to zero. for each x, t D do Input the instance x to the unit and compute the output o for each w i do w i w i + η(t o)x i for each weight w i do w i w i + w i

Incremental (Stochastic) GRADIENT DESCENT I Approximation of the standard GRADIENT-DESCENT Batch GRADIENT-DESCENT: Do until satisfied 1 Compute the gradient E D [ w] 2 w w η E D [ w] Incremental GRADIENT-DESCENT: Do until satisfied For each training example d in D 1 Compute the gradient E d [ w] 2 w w η E d [ w]

Incremental (Stochastic) GRADIENT DESCENT II E D [ w] 1 (t d o d ) 2 2 d D E d [ w] 1 2 (t d o d ) 2 Training rule (delta rule): w i η(t o)x i similar to perceptron training rule, yet unthresholded convergence is only asymptotically guaranteed linear separability is no longer needed!

Standard vs. Stochastic GRADIENT-DESCENT Incremental-GD can approximate Batch-GD arbitrarily closely if η made small enough error summed over all examples before summing updated upon each example standard GD more costly per update step and can employ larger η stochastic GD may avoid falling in local minima because of using E d instead of E D

Newton s Algorithm J( a) J( a(k)) + J t ( a a(k)) + 1 2 ( a a(k)) t H( a a(k)) where H = 2 J a i a j is the Hessian matrix Choose a(k + 1) to minimize this function: a(k + 1) a(k) H 1 J( a) Greater improvement per step than GD but not applicable when H is singular Time complexity O(d 3 )

Perceptron I Assumption: data is linearly separable Hyperplane: d i=0 w ix i = 0 assuming that there is a constant attribute x 0 = 1 (bias) Algorithm for learning separating hyperplane: perceptron learning rule Classifier: If d i=0 w ix i > 0 then predict ω 1 (or +1), otherwise predict ω 2 (or 1)

Perceptron II Thresholded output { +1 if w0 + w o(x 1,..., x n ) = 1 x 1 + + w d x d > 0 1 otherwise. { +1 if w x > 0 Simpler vector notation: o( x) = sgn( x) = 1 otherwise. Space of the hypotheses: { w w R n }

Decision Surface of a Perceptron Can represent some useful functions What weights represent g(x 1, x 2 ) = AND(x 1, x 2 )? But some functions not representable e.g., not linearly separable (XOR) Therefore, we ll want networks of these...

Perceptron Training Rule I Perceptron criterion function: J( a) = y Y ( a) ( a t y) where Y ( a) is the set of examples misclassified by a If no examples are misclassified, Y ( a) is empty and J( a) = 0 (i.e. a is a solution vector) J( a) 0, since a t y i 0 if y i is misclassified Geometrically, J( a) is proportional to the sum of the distances from the misclassified examples to the decision boundary Since J = y Y ( a) ( y) the update rule becomes a(k + 1) a(k) + η(k) y Y k ( a) where Y ( a) is the set of examples misclassified by a(k) y

Perceptron Training I Set all coefficient a i to zero do for each instance y in the training data if y is classified incorrectly by the perceptron if y belongs to ω 1 add it to a else subtract it from a until all instances in the training data are classified correctly return a

Perceptron Training II BATCH PERCEPTRON TRAINING Initialize a, η, θ, k 0 do k k + 1 a a + η(k) y Y k y until η(k) y Y k < θ return a Can prove it will converge If training data is linearly separable and η sufficiently small

Perceptron Training III Why does this work? Consider situation where an instance pertaining to the first class has been added: (a 0 + y 0 )y 0 + (a 1 + y 1 )y 1 + (a 2 + y 2 )y 2 +... + (a d + a d )y d This means output for a has increased by: y 0 y 0 + y 1 y 1 + y 2 y 2 +... + y d y d always positive, thus the hyperplane has moved into the correct direction (and output decreases for instances of other class)

Perceptron Training IV η = 1 and a(1) = 0. Sequence of misclassified instances: y 1 + y 2 + y 3, y 2, y 3, y 1, y 3 stop

Perceptron Simplification FIXED-INCREMENT SINGLE-EXAMPLE PERCEPTRON input: { y (k) } n k=1 training examples begin initialize a, k = 0 do k (k + 1) mod n if y (k) is misclassified by the model based on a then a a + y (k) until all examples properly classified return a end

Generalizations I VARIABLE-INCREMENT PERCEPTRON WITH MARGIN begin initialize a, θ, margin b, η, k 0 do k (k + 1) mod n if a t y (k) b then a a + y (k) until a t y (k) > b for all k return a end

Generalizations II BATCH VARIABLE-INCREMENT PERCEPTRON begin initialize a, η, k 0 do k (k + 1) mod n Y k j 0 do j j + 1 if y j misclassified then Y k Y k {y j } until j = n a a + y Y k y until Y k = return a end

Comments Perceptron adjusts the parameters only when it encounters an error, i.e. a misclassified training example Correctly classified examples can be ignored The learning rate η can be chosen arbitrarily, it will only impact on the norm of the final a (and the corresponding magnitude of a 0 ) The final weight vector a is a linear combination of training points

Nonseparable Case The Perceptron is an error correcting procedure converges when the examples are linearly separable Even if a separating vector is found for the training examples, it does not follow that the resulting classifier will perform well on independent test data To ensure that the performance on training and test data will be similar, many training examples should be used. Sufficiently large training examples are almost certainly non linearly separable No weight vector can correctly classify every example in a nonseparable set The corrections may never cease if set is nonseparable

Learning rate If we choose η(k) 0 as k then performance can be acceptable on non-separable problems while preserving the ability to find a solution on separable problems η(k) can be considered as a function of recent performance, decreasing it as performance improves: e.g. η(k) η/k The rate at which η(k) approaches zero is important: Too slow: result will be sensitive to those examples that render the set non-separable Too fast: may converge prematurely with sub-optimal results

Linear Models: WINNOW Another mistake-driven algorithm for finding a separating hyperplane Assumes binary attributes (i.e. propositional variables) Main difference: multiplicative instead of additive updates Weights are multiplied by a parameter α > 1 (or its inverse) Another difference: user-specified threshold parameter θ Predict first class if w 0 + w 1 x 1 + w 2 x 2 + + w k x k > θ

The Algorithm I WINNOW initialize a, α while some instances are misclassified for each instance y in the training data classify y using the current model a if the predicted class is incorrect if y belongs to the target class for each attribute y i = 1, multiply a i by α (if y i = 0, a i is left unchanged) otherwise for each attribute y i = 1, divide a i by α (if y i = 0, a i is left unchanged)

The Algorithm II WINNOW is very effective in homing in on relevant features (it is attribute efficient) Can also be used in an on-line setting in which new instances arrive continuously (like the perceptron algorithm)

Balanced WINNOW I WINNOW doesn t allow negative weights and this can be a drawback in some applications BALANCED WINNOW maintains two weight vectors, one for each class: a + and a Instance is classified as belonging to the first class (of two classes) if: (a + 0 a 0 )+(a+ 1 a 1 )y 1+(a + 2 a 2 )y 2+ +(a + k a k )y k > θ

Balanced WINNOW II BALANCED WINNOW while some instances are misclassified for each instance a in the training data classify a using the current weights if the predicted class is incorrect if a belongs to the first class for each attribute y i = 1, multiply a + i by α and divide a i by α (if y i = 0, leave a + i and a i unchanged) otherwise for each attribute y i = 1, multiply a i by α and divide a + i by α (if y i = 0, leave a + i and a i unchanged)

Minimum Squared Error Approach I Minimum Squared Error (MSE) It trades the ability to obtain a separating vector for good performance on both separable and non-separable problems Previously, we sought a weight vector a making all of the inner products a t y 0 In the MSE procedure, one tries to make a t y i = b i, where b i are some arbitrarily specified positive constants Using matrix notation: Y a = b If Y is nonsingular, then a = Y 1 b Unfortunately Y is not a square matrix, usually with more rows than columns When there are more equations than unknowns, a is overdetermined, and ordinarily no exact solution exists.

Minimum Squared Error Approach II We can seek a weight vector a that minimizes some function of an error vector e = Y a b Minimizing the squared length of the error vector is equivalent to minimizing the sum-of-squared-error criterion function J( a) = Y a b 2 = n ( a t y i b i ) 2 i=1 whose gradient is n J = 2 ( a t y i b i ) y i = 2Y t (Y a b) i=1 Setting the gradient equal to zero, the following necessary condition holds: Y t Y a = Y t b

Minimum Squared Error Approach III Y t Y is a square matrix which is often nonsingular. Therefore, solving for a: a = (Y t Y ) 1 Y t b = Y + b where Y + = (Y t Y ) 1 Y t is the pseudo-inverse of Y Y + can be written also as lim ɛ 0 (Y t Y + ɛi) 1 Y t and it can be shown that this limit always exists, hence a = Y + b the MSE solution to the problem Y a = b

WIDROW-HOFF procedure a.k.a. LMS I The criterion function J( a) = Y a b 2 could be minimized by a gradient descent procedure Advantages: Avoids the problems that arise when Y t Y is singular Avoids the need for working with large matrices Since J = 2Y t (Y a b) a simple update rule would be { a(1) arbitrary a(k + 1) = a(k) + η(k)(y a b) or, if we consider the examples sequentially { a(1) arbitrary a(k + 1) = a(k) + η(k) [ b k a(k) t y(k) ] y(k)

WIDROW-HOFF procedure a.k.a. LMS II LMS({ y i } n i=1 ) input { y i } n i=1 : training examples begin Initialize a, b, θ, η( ), k 0 do k k + 1 mod n a a + η(k)(b k a(k) t y(k)) y(k) until η(k)(b k a(k) t y(k)) y(k) < θ return a end

Summary Perceptron training rule guaranteed to succeed if Training examples are linearly separable Sufficiently small learning rate η Linear unit training rule uses gradient descent Guaranteed to converge to hypothesis with MSE Given sufficiently small learning rate η Even when training data contains noise Even when training data not separable by H

Linear Regression Standard technique for numeric prediction Outcome is linear combination of attributes: x = w 0 + w 1 x 1 + w 2 x 2 + + w d x d Weights are calculated from the training data standard math algorithms w Predicted value for first training instance x (1) w 0 + w 1 x (1) 1 + w 2 x (1) 2 + + w d x (1) d = assuming extended vectors with x 0 = 1 d j=0 w j x (1) j

Probabilistic Classification Multiresponse Linear Regression (MLR) Any regression technique can be used for classification Training: perform a regression for each class g i linear compute each linear expression for each class, setting the output to 1 for training instances that belong to the class and 0 for those that don t Prediction: predict class corresponding to model with largest output value (membership value)

Logistic Regression I MLR drawbacks 1 membership values are not proper probabilities they can fall outside [0, 1] 2 least squares regression assumes that: the errors are not only statistically independent, but are also normally distributed with the same standard deviation Logit transformation does not suffer from these problems Builds a linear model for a transformed target variable Assume we have two classes

Logistic Regression II Logistic regression replaces the target Pr(1 x) that cannot be approximated well using a linear function with this target ( ) Pr(1 x) log 1 Pr(1 x) Transformation maps [0, 1] to (, + )

Logistic Regression III logit tranformation function

Example: Logistic Regression Model Resulting model: Pr(1 y) = 1/ ( 1 + e (a 0+a 1 y 1 +a 2 y 2 + +a d y d ) ) Example: Model with a 0 = 0.5 and a 1 = 1: Parameters induced from data using maximum likelihood

Maximum Likelihood Aim: maximize probability of training data with respect to the parameters Can use logarithms of probabilities and maximize log-likelihood of model and MSE: n i=1 ( 1 x (i)) ( ) ( ) log 1 Pr(1 y (i) ) +x (i) log 1 Pr(1 y (i) ) where the x (i) s are the responses (either 0 or 1) Weights a i need to be chosen to maximize log-likelihood relatively simple method: iteratively re-weighted least squares

Credits R. Duda, P. Hart, D. Stork: Pattern Classification, Wiley T. M. Mitchell: Machine Learning, McGraw Hill I. Witten & E. Frank: Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann