Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS

Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class.

Classification: Problem Statement 3 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input variable x may still be continuous, but the target variable is discrete. In the simplest case, t can have only 2 values. e.g., Let t = +1 x assigned to C 1 t = 1 x assigned to C 2

Example Problem 4 Animal or Vegetable?

Discriminative Classifiers 5 If the conditional distributions are normal, the best thing to do is to estimate the parameters of these distributions and use Bayesian decision theory to classify input vectors. Decision boundaries are generally quadratic. However if the conditional distributions are not exactly normal, this generative approach will yield sub-optimal results. Another alternative is to build a discriminative classifier, that focuses on modeling the decision boundary directly, rather than the conditional distributions themselves.

Linear Models for Classification 6 Linear models for classification separate input vectors into classes using linear (hyperplane) decision boundaries. Example: 2D Input vector x Two discrete classes C 1 and C 2 4 2 0 x 2 2 4 6 8 4 2 0 2 4 6 8 x 1

Two Class Discriminant Function 7 y>0 y =0 y<0 x 2 R 2 R 1 y(x) = w t x +w 0 y(x) 0 x assigned to C 1 w x y(x) w y(x) < 0 x assigned to C 2 Thus y(x) = 0 defines the decision boundary x w 0 w x 1

Two-Class Discriminant Function 8 y(x) = w t x +w 0 y(x) 0 x assigned to C 1 y(x) < 0 x assigned to C 2 y>0 y =0 y<0 x 2 R 2 R 1 For convenience, let w = w 1 w M t w 0 w 1 w M and t w x x y(x) w x = x 1 x M t 1 x1 x M t w 0 w x 1 So we can express y(x) = w t x

Generalized Linear Models 9 For classification problems, we want y to be a predictor of t. In other words, we wish to map the input vector into one of a number of discrete classes, or to posterior probabilities that lie between 0 and 1. For this purpose, it is useful to elaborate the linear model by introducing a nonlinear activation function f, which typically will constrain y to lie between -1 and 1 or between 0 and 1. y(x) = f ( w t x + w ) 0

The Perceptron 10 y(x) = f ( w t x + w ) 0 y(x) 0 x assigned to C 1 y(x) < 0 x assigned to C 2 A classifier based upon this simple generalized linear model is called a (single layer) perceptron. It can also be identified with an abstracted model of a neuron called the McCulloch Pitts model.

End of Lecture Oct 15, 2012

Parameter Learning 12 How do we learn the parameters of a perceptron?

Outline 13 The Perceptron Algorithm Least-Squares Classifiers Fisher s Linear Discriminant Logistic Classifiers

Case 1. Linearly Separable Inputs 14 For starters, let s assume that the training data is in fact perfectly linearly separable. In other words, there exists at least one hyperplane (one set of weights) that yields 0 classification error. We seek an algorithm that can automatically find such a hyperplane.

The Perceptron Algorithm 15 The perceptron algorithm was invented by Frank Rosenblatt (1962). The algorithm is iterative. The strategy is to start with a random guess at the weights w, and to then iteratively change the weights to move the hyperplane in a direction that lowers the classification error. Frank Rosenblatt (1928 1971)

The Perceptron Algorithm 16 Note that as we change the weights continuously, the classification error changes in discontinuous, piecewise constant fashion. Thus we cannot use the classification error per se as our objective function to minimize. What would be a better objective function?

The Perceptron Criterion 17 Note that we seek w such that w t x 0 when t = +1 w t x < 0 when t = 1 In other words, we would like w t x n t n 0 n Thus we seek to minimize E P ( w) = w t x n t n n M where M is the set of misclassified inputs.

The Perceptron Criterion 18 E P ( w) = w t x n t n n M where M is the set of misclassified inputs. Observations: E P (w) is always non-negative. E P (w) is continuous and piecewise linear, and thus easier to minimize. E P ( w) w i

The Perceptron Algorithm 19 E P ( w) = w t x n t n n M where M is the set of misclassified inputs. ( ) de P w dw = x t n n n M where the derivative exists. Gradient descent: E P ( w) w τ +1 = w τ η E P ( w) = w τ + η x n t n n M w i

The Perceptron Algorithm 20 w τ +1 = w τ η E P ( w) = w t + η x n t n n M Why does this make sense? If an input from C 1 (t = +1) is misclassified, we need to make its projection on w more positive. If an input from C 2 (t = -1) is misclassified, we need to make its projection on w more negative.

The Perceptron Algorithm 21 The algorithm can be implemented sequentially: Repeat until convergence: n For each input (x n, t n ): n If it is correctly classified, do nothing n If it is misclassified, update the weight vector to be w τ +1 = w τ + ηx n t n n Note that this will lower the contribution of input n to the objective function: ( ) t (τ +1) x n t n ( w ) t x n t n = ( w ) (τ ) t x n t n η x n t n w (τ ) ( ) t x n t n < w (τ ) ( ) t x n t n.

Not Monotonic 22 While updating with respect to a misclassified input n will lower the error for that input, the error for other misclassified inputs may increase. Also, new inputs that had been classified correctly may now be misclassified. The result is that the perceptron algorithm is not guaranteed to reduce the total error monotonically at each stage.

The Perceptron Convergence Theorem 23 Despite this non-monotonicity, if in fact the data are linearly separable, then the algorithm is guaranteed to find an exact solution in a finite number of steps (Rosenblatt, 1962).

Example 24 1 1 0.5 0.5 0 0 0.5 0.5 1 1 0.5 0 0.5 1 1 1 0.5 0 0.5 1 1 1 0.5 0.5 0 0 0.5 0.5 1 1 0.5 0 0.5 1 1 1 0.5 0 0.5 1

The First Learning Machine 25 Mark 1 Perceptron Hardware (c. 1960) Visual Inputs Patch board allowing configuration of inputsφ Rack of adaptive weights w (motor-driven potentiometers)

Practical Limitations 26 The Perceptron Convergence Theorem is an important result. However, there are practical limitations: Convergence may be slow If the data are not separable, the algorithm will not converge. We will only know that the data are separable once the algorithm converges. The solution is in general not unique, and will depend upon initialization, scheduling of input vectors, and the learning rate η.

Generalization to inputs that are not linearly separable. 27 The single-layer perceptron can be generalized to yield good linear solutions to problems that are not linearly separable. Example: The Pocket Algorithm (Gal 1990) Idea: n Run the perceptron algorithm n Keep track of the weight vector w* that has produced the best classification error achieved so far. n It can be shown that w* will converge to an optimal solution with probability 1.

Generalization to Multiclass Problems 28 How can we use perceptrons, or linear classifiers in general, to classify inputs when there are K > 2 classes?

K>2 Classes 29 Idea #1: Just use K-1 discriminant functions, each of which separates one class C k from the rest. (Oneversus-the-rest classifier.) Problem: Ambiguous regions? R 1 R 2 C 1 R 3 not C 1 C 2 not C 2

K>2 Classes 30 Idea #2: Use K(K-1)/2 discriminant functions, each of which separates two classes C j, C k from each other. (One-versus-one classifier.) Each point classified by majority vote. Problem: Ambiguous regions C 1 C 3 R 1 C 1? R 3 C 2 R 2 C 3 C 2

K>2 Classes 31 Idea #3: Use K discriminant functions y k (x) Use the magnitude of y k (x), not just the sign. y k (x) = w k t x x assigned to C k if y k (x) > y j (x) j k Decision boundary y k (x) = y j (x) ( w k w j ) t x + ( w k 0 w j 0 ) = 0 Results in decision regions that are simply-connected and convex. R i R j R k x B x A x

Example: Kesler s Construction 32 The perceptron algorithm can be generalized to K- class classification problems. Example: Kesler s Construction: n Allows use of the perceptron algorithm to simultaneously learn K separate weight vectors w i. n Inputs are then classified in Class i if and only if w t i x > w t j x j i n The algorithm will converge to an optimal solution if a solution exists, i.e., if all training vectors can be correctly classified according to this rule.

1-of-K Coding Scheme 33 When there are K>2 classes, target variables can be coded using the 1-of-K coding scheme: Input from Class C i t = [0 0 1 0 0] t Element i

Computational Limitations of Perceptrons 34 Initially, the perceptron was thought to be a potentially powerful learning machine that could model human neural processing. However, Minsky & Papert (1969) showed that the singlelayer perceptron could not learn a simple XOR function. This is just one example of a non-linearly separable pattern that cannot be learned by a single-layer perceptron. x 2 x 1 Marvin Minsky (1927 - )

Multi-Layer Perceptrons 35 Minsky & Papert s book was widely misinterpreted as showing that artificial neural networks were inherently limited. This contributed to a decline in the reputation of neural network research through the 70s and 80s. However, their findings apply only to single-layer perceptrons. Multilayer perceptrons are capable of learning highly nonlinear functions, and are used in many practical applications.

Outline 36 The Perceptron Algorithm Least-Squares Classifiers Fisher s Linear Discriminant Logistic Classifiers

Dealing with Non-Linearly Separable Inputs 37 The perceptron algorithm fails when the training data are not perfectly linearly separable. Let s now turn to methods for learning the parameter vector w of a perceptron (linear classifier) even when the training data are not linearly separable.

The Least Squares Method 38 In the least squares method, we simply fit the (x, t) observations with a hyperplane y(x). Note that this is kind of a weird idea, since the t values are binary (when K=2), e.g., 0 or 1. However it can work pretty well.

Least Squares: Learning the Parameters 39 Assume D dimensional input vectors x. For each class k 1 K : y k (x) = w k t x + w k 0 y(x) = W t x where x = (1,x t ) t ( ) t W is a (D + 1) K matrix whose kth column is w k = w 0,w k t

Learning the Parameters 40 Method #2: Least Squares y(x) = W t x Training dataset ( x n,t n ), n = 1,,N where we use the 1-of-K coding scheme for t n Let T be the N K matrix whose n th row is t n t Let X be the N (D + 1) matrix whose n th row is x n t Let R D ( W ) = X W T Then we define the error as E D ( W ) = 1 2 i,j R ij 2 { ( )} = 1 2 Tr R W D ( ) t R D W Setting derivative wrt W to 0 yields: W = ( X t X ) 1 X t T = X T Recall: Tr ( AB) = B t. A

Outline 41 The Perceptron Algorithm Least-Squares Classifiers Fisher s Linear Discriminant Logistic Classifiers

Fisher s Linear Discriminant 42 Another way to view linear discriminants: find the 1D subspace that maximizes the separation between the two classes. Let m 1 = 1 x N n, m 2 = 1 x 1 N n 2 n C 1 n C 2 For example, might choose w to maximize w t ( m 2 m 1 ), subject to w = 1 This leads to w m 2 m 1 4 2 However, if conditional distributions are not isotropic, this is typically not optimal. 0 2 2 2 6

Fisher s Linear Discriminant 43 Let m 1 = w t m 1, m 2 = w t m 2 be the conditional means on the 1D subspace. Let s k 2 = ( y n m k ) 2 be the within-class variance on the subspace for class C k n C k ( The Fisher criterion is then J(w) = m m 2 1) 2 s 2 2 1 + s 2 This can be rewritten as J(w) = w t S B w w t S W w where S B = m 2 m 1 and ( )( m 2 m 1 ) t is the between-class variance S W = x n m 1 + x n m 2 is the within-class variance n C 1 ( )( x n m 1 ) t n C 2 J(w) is maximized for w S 1 W ( m 2 m 1 ) ( )( x n m 2 ) t 4 2 0 2 2 2 6

Connection to MVN Maximum Likelihood 44 J(w) is maximized for w S 1 W ( m 2 m 1 ) Recall that if the two distributions are normal with the same covarianceσ, the maximum likelihood classifier is linear, with w Σ 1 ( m 2 m ) 1 Further, note that S W is proportional to the maximum likelihood estimator forσ. Thus FLD is equivalent to assuming MVN distributions with common covariance.

45 Connection to Least-Squares Change coding scheme used in least-squares method to t n = N N 1 for C 1 t n = N N 2 for C 2 Then one can show that the ML w satisfies ( ) w S W 1 m 2 m 1

End of Lecture October 17, 2012

Problems with Least Squares 47 Problem #1: Sensitivity to outliers 4 4 2 2 0 0 2 2 4 4 6 6 8 4 2 0 2 4 6 8 8 4 2 0 2 4 6 8

Problems with Least Squares 48 Problem #2: Linear activation function is not a good fit to binary data. This can lead to problems. 6 4 2 0 2 4 6 6 4 2 0 2 4 6

Outline 49 The Perceptron Algorithm Least-Squares Classifiers Fisher s Linear Discriminant Logistic Classifiers

Logistic Regression (K = 2) 50 ( ) p( C 1 φ) = y ( φ) = σ w t φ p( C 2 φ) = 1 p( C 1 φ) p ( C 1 φ) = y ( φ) = σ ( w t φ) where σ (a) = 1 1+ exp( a) w 1 w 2 φ 2 w t φ x φ 1

Logistic Regression 51 ( ) = y ( φ) = σ w t φ ( ) = 1 p ( C 1 φ) p C 1 φ p C 2 φ where σ (a) = 1 1+ exp( a) ( ) Number of parameters Logistic regression: M Gaussian model: 2M + 2M(M+1)/2 + 1 = M 2 +3M+1

ML for Logistic Regression 52 ( ) t = y n { n 1 y } 1 t n n p t w N where t = t 1,,t N n=1 ( ) t and y n = p ( C 1 φ ) n We define the error function to be E(w) = log p ( t w) Given y n = σ ( a ) n and a n = w t φ n, one can show that E(w) = N ( y n t n )φ n n=1 Unfortunately, there is no closed form solution for w.

ML for Logistic Regression: 53 Iterative Reweighted Least Squares Although there is no closed form solution for the ML estimate of w, fortunately, the error function is convex. Thus an appropriate iterative method is guaranteed to find the exact solution. A good method is to use a local quadratic approximation to the log likelihood function (Newton- Raphson update): w (new) = w (old ) H 1 E(w) where H is the Hessian matrix of E(w)

The Hessian Matrix H 54 H = w w E(w), i.e., H ij = E(w) w i w j. H ij describes how the i th component of the gradient varies as we move in the w j direction. Let u be any unit vector. Then Hu describes the variation in the gradient as we move in the direction u. u t Hu describes the projection of this variation onto u. Thus u t Hu measures how much the gradient is changing in the u direction as we move in the u direction.

ML for Logistic Regression 55 w (new) = w (old ) H 1 E(w) where H is the Hessian matrix of E(w) : H = Φ t RΦ where R is the N N diagonal weight matrix with R nn = y n ( 1 y n ) and Φ is the N M design matrix whose n th row is given by φ n t. (Note that, since R nn 0, R is positive semi-definite, and hence H is positive semi-definite Thus E(w) is convex.) Thus ( ) w new = w (old ) ( Φ t RΦ) 1 Φ t y t See Problem 8.3 in the text!

ML for Logistic Regression 56 Iterative Reweighted Least Squares ( ) = y ( φ) = σ w t φ p C 1 φ ( ) w 1 w 2 w t φ

Logistic Regression 57 For K>2, we can generalize the activation function by modeling the posterior probabilities as p( C k φ) = y k ( φ) = exp a k ( ) ( ) exp a j where the activations a k are given by a k = w k t φ j

Example 58 6 6 4 4 2 2 0 0 2 2 4 4 6 6 4 2 0 2 4 6 6 6 4 2 0 2 4 6 Least-Squares Logistic

Outline 59 The Perceptron Algorithm Least-Squares Classifiers Fisher s Linear Discriminant Logistic Classifiers