Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Size: px

Start display at page:

Download "Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396"

Florence Atkins
5 years ago
Views:

1 Data Mining Linear & nonlinear classifiers Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

2 Table of contents 1 Introduction 2 Linear discriminant analysis 3 Linear classifiers 4 Support vector machines 5 Non-linear support vector machine 6 Multi-class Classifiers One-against-all classification One-against-one classification Error correcting coding classification Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

3 Table of contents 1 Introduction 2 Linear discriminant analysis 3 Linear classifiers 4 Support vector machines 5 Non-linear support vector machine 6 Multi-class Classifiers One-against-all classification One-against-one classification Error correcting coding classification Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

Each training input x is a D dimensional vector of numbers. Approaches for building a classifier.

4 Introduction In classification, the goal is to find a mapping from inputs X to outputs t {1, 2,..., C} given a labeled set of input-output pairs (training set) S = {(x 1, t 1 ), (x 2, t 2 ),..., (x N, t N )}. Each training input x is a D dimensional vector of numbers. Approaches for building a classifier. Generative approach: This approach first creates a joint model of the form of p(x, C n ) and then to condition on x, then deriving p(c n x). Discriminative approach: This approach creates a model of the form of p(c n x) directly. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

5 Table of contents 1 Introduction 2 Linear discriminant analysis 3 Linear classifiers 4 Support vector machines 5 Non-linear support vector machine 6 Multi-class Classifiers One-against-all classification One-against-one classification Error correcting coding classification Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

6 of the data space. The centered data matrix is obtained by subtracting the mean from Linear discriminant all the points: analysis (LDA) Z = D 1 µ T = One way to view a linear classification model. is. in terms. of dimensionality. reduction. x Assume that we want to project a vector T n µ onto T x another T n µt z vector T n to obtain a new point after where z a change of the basis vectors. i = x i µ represents the centered point corresponding to x i,and1 R n is the Let a, b R n n-dimensional vector all of whose elements have value 1. The mean of the centered be data twomatrix n-dimensional Z is 0 R d,becausewehavesubtractedthemeanµ vectors. from all the points x i. An orthogonal decomposition of the vector b in the direction of another vector a equals to Orthogonal Projection where b = b is parallel to a and r = b is perpendicular to a. x T 1 x T 2 µ T µ T x T 1 µt = x T 2 µt = z T 1 z T 2 (1.5) Often in data mining we need to project a point or vector onto another vector, for example, to obtain a new point b = after b a+ change b = of the p basis + r vectors. Let a,b R m be two m-dimensional vectors. An orthogonal decomposition of the vector b in the direction X 2 4 b 3 r = b a p = b X 1 Figure 1.4. Orthogonal projection. Vector p is called orthogonal projection or projection of b on vector a. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

7 Linear discriminant Orthogonal analysis Projection (LDA) Often in data mining we need to project a point or vector onto another vector, for example, to obtain a new point after a change of the basis vectors. Let a,b R m be two m-dimensional vectors. An orthogonal decomposition of the vector b in the direction p can be written as p = ca, where c is scaler and p is parallel to a. X 2 4 b 3 r = b a p = b X 1 Figure 1.4. Orthogonal projection. Thus r = b p = b ca. Since p and r are orthogonal, we have This implies Therefor, the projection of b on a equals to p T r = (ca) T (b ca) = ca T b c 2 a T a = 0. c = at b a T a p = b = ca = ( a T ) b a T a a Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

8 Linear discriminant analysis (LDA) Consider a two-class problem and suppose we take a D dimensional input vector x and project it down to one dimension using z = W T x If we place a threshold on z and classify z w 0 as class C 1, and otherwise class C 2, then 500 Linear Discriminant Analysis we obtain our standard linear classifier Figure Linear discriminant direction w. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31 w

9 Linear discriminant analysis (cont.) Consider a two-class problem in which there are N 1 points of class C 1 and N 2 points of class C 2. Hence the mean vectors of the class C j is given by µ j = 1 N j i C j x i The simplest measure of the separation of the classes, when projected onto W, is the separation of the projected class means. This suggests that we might choose W so as to maximize where m 2 m 1 = W T (µ 2 µ 1 ) m j = W T µ j This expression can be made arbitrarily large simply by increasing the magnitude of W. To solve this problem, we could constrain W to have unit length, so that i w 2 i = 1. Using a Lagrange multiplier to perform the constrained maximization, we then find that W (µ 2 µ 1 ) Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

10 Linear discriminant analysis (cont.) This approach has a problem: The following figure shows two classes that are well ASSIFICATION separated in the original two dimensional space but that have considerable overlap when projected onto the line joining their means wo classes (depicted in red and blue) along with the histograms e class means. Note that there is considerable class overlap in rresponding projection based on the Fisher linear discriminant, This difficulty arises from the strongly non-diagonal covariances of the class distributions. The idea proposed by Fisher is to maximize a function that will give a large separation between the projected class means while also giving a small variance within each class, thereby minimizing the class overlap. ected data from class C k. However, this expression can be simply by increasing the magnitude of w. To solve this strain w to have unit length, so that wi 2 = 1. Using Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

11 Linear discriminant analysis (cont.) The idea proposed by Fisher is to maximize a function that will give a large separation between the projected class means while also giving a small variance within each class, thereby minimizing the class overlap. The projection z = W T x transforms the set of labeled data points in x into a labeled set in the one-dimensional space z. The within-class variance of the transformed data from class C j equals s 2 j = i C j (z i m j ) 2 where z i = w T x i. We can define the total within-class variance for the whole data set to be s s2 2. The Fisher criterion is defined to be the ratio of the between-class variance to the within-class variance and is given by J(W ) = (m 2 m 1 ) 2 s s2 2 Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

12 Linear discriminant analysis (cont.) Between-class covariance matrix equals to S B = (µ 2 µ 1 )(µ 2 µ 1 ) T Total within-class covariance matrix equals to S W = (x i µ 1 ) (x i µ 1 ) T + (x i µ 2 ) (x i µ 2 ) T i C 1 i C 2 We have (m 1 m 2 ) 2 = (W T µ 1 W T µ 2 ) 2 = W T (µ 1 µ 2 )(µ 1 µ 2 ) T W }{{} S B = W T S B W, Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

13 Linear discriminant analysis (cont.) Also we have s 2 1 = i ( W T x i µ 2 ) 2 ti = i W T (x i µ 1 )(x i µ 1 ) 2 Wt i [ ] = W T ((x i µ 1 )(x i µ 1 ) 2 W i }{{} S 1 = W T S 1 W, and S W = S 1 + S 2. Hence, J(W ) can be written as J(w) = W T S B W W T S W W Derivative of J(W ) with respect to W equals to (using xt Ax x W S 1 W (µ 2 µ 1 ) = (A + A T )x) The result W S 1 W (µ 2 µ 1 ) is known as Fisher s linear discriminant, although strictly it is not a discriminant but rather a specific choice of direction for projection of the data down to one dimension. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

14 Table of contents 1 Introduction 2 Linear discriminant analysis 3 Linear classifiers 4 Support vector machines 5 Non-linear support vector machine 6 Multi-class Classifiers One-against-all classification One-against-one classification Error correcting coding classification Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

15 Linear classifiers We consider the following type of linear classifiers. y(x n ) = g(x n ) = sign(w 1 x n1 + w 2 x n w D x nd ) { 1, +1} D ) = sign w n x nj = sign (w T x n. j=1 w = (w 1, w 2,..., w D ) T R D. Different value of w give different functions. x n = (x n1, x n2,..., x nd ) T is a column vector of real values. This classifier changes its prediction only when the argument to the sign function changes from positive to negative (or vice versa). Geometrically,this transition in the feature space corresponds to crossing the decision LINEAR MODELS FOR CLASSIFICATION boundary where the argument is exactly zero: all x such that w T x = Hamid Beigy (Sharif University of Technology) Figure 4.4 The left plot shows data from Data two classes, Mining denoted by red crosses and blue circles, together with Fall / 31

16 Table of contents 1 Introduction 2 Linear discriminant analysis 3 Linear classifiers 4 Support vector machines 5 Non-linear support vector machine 6 Multi-class Classifiers One-against-all classification One-against-one classification Error correcting coding classification Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

17 Support vector machines Consider the problem of finding a separating hyperplane for a linearly separable dataset S = {(x 1, t 1 ), (x 2, t 2 ),..., (x N, t N )} with x i R D and t i { 1, +1}. Which of the infinite hyperplanes should we choose? Hyperplanes that pass too close to the training examples will be sensitive to noise and, therefore, less likely to generalize well for data outside the training set. It is reasonable to expect that a hyperplane that is farthest from all training examples will have better generalization capabilities. We can find the maximum margin linear classifier by first identifying a classifier that correctly classifies all the examples and then increasing the geometric margin until we cannot increase the margin any further. We can also set up an optimization problem for directly maximizing the geometric margin. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

18 Support vector machines (cont.) We will need the classifier to be correct on all the training examples (t n w T x n +1 for all n = 1, 2,..., N) subject to these constraints, we would like to maximize the geometric margin ( +1 w ). Hence, we have Maximize 1 w subject to t n w T x n 1 for all n = 1, 2,..., N We can alternatively minimize the inverse w or the inverse squared w 2 subject to the same constraints. Minimize 1 2 w 2 subject to t n w T x n 1 for all n = 1, 2,..., N Factor 1 2 is included merely for later convenience. The above problem can be written as Minimize 1 2 w 2 subject to t n w T x n 1 for all n = 1, 2,..., N Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

19 Support vector machines (cont.) The SVM optimization problem can be written as Minimize 1 2 w 2 subject to t n w T x n 1 for all n = 1, 2,..., N This optimization problem is in the standard SVM form and is a quadratic programming problem. We will modify the linear classifier here slightly by adding an offset term so that the decision boundary does not have to go through the origin. In other words, the classifier that we consider has the form g(x) = w T x + b w is the weight vector b is the bias of the separating hyperplane. The hyperplane is shown by (w, b). The bias parameter changes the optimization problem to Minimize 1 2 w 2 subject to t n ( w T x n + b ) 1 for all n = 1, 2,..., N Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

20 Support vector machines (cont.) The optimization problem for SVM is defined as Minimize 1 2 w 2 subject to t n ( w T x n + b ) 1 for all n = 1, 2,..., N In order to solve this constrained optimization problem, we introduce Lagrange multipliers α n 0, with one multiplier α n for each of the constraints giving the Lagrangian function L(w, b, α) = 1 N [ ) ] 2 w 2 α n t n (w T x n + b 1 n=1 where α = (α 1, α 2,..., α N ) T. Note the minus sign in front of the Lagrange multiplier term, because we are minimizing with respect to w and b, and maximizing with respect to α. Setting the derivatives of L(w, b, α) with respect to w and b equal to zero, we obtain the following two equations L N w = 0 w = α n t n x n n=1 L N b = 0 0 = α n t n Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31 n=1

21 Support vector machines (cont.) L has to be minimized with respect to the primal variables w and b and maximized with respect to the dual variables α n. Eliminating w and b from L(w, b, a) using these conditions then gives the dual representation of the problem in which we maximize N L(α) = α n 1 N N α n α m t n t m xn T x m 2 n=1 n=1 m=1 We need to maximize L(α) subject to the following constraints N α n 0 n and α n t n = 0 The constrained optimization of this form satisfies the Karush-Kuhn-Tucker (KKT) conditions, which in this case require that the following three properties hold n=1 α n 0 t n g(x n ) 1 α n [t n g(x n ) 1] = 0 To classify a data x using the trained model, we evaluate the sign of g(x) defined by N g(x) = α n t n xn T x n=1 Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

22 Support vector machines (cont.) We have assumed that the training data are linearly separable in the feature space. The resulting SVM will give exact separation of the training data. In the practice, the class-conditional distributions may overlap, in which the exact separation of the training data can lead to poor generalization. We need a way to modify the SVM so as to allow some training examples to be miss-classified. To do this, we introduce slack variables (ξ n 0); one slack variable for each training example. The slack variables are defined by ξ n = 0 for examples that are inside the correct boundary margin and ξ n = t n g(x n ) for other examples. Thus for data point that is on the decision boundary g(x n ) = 0 will have ξ n = 1 and the data points with ξ n 1 will be misclassified. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

23 Support vector machines (cont.) The exact classification constraints will be t n g(x n ) 1 ξ n for n = 1, 2,..., N Our goal is now to maximize the margin while softly penalizing points that lie on the wrong side of the margin boundary. We therefore minimize C N ξ n w 2 n=1 C > 0 controls the trade-off between the slack variable penalty and the margin. We now wish to solve the following optimization problem. Minimize 1 2 w 2 + C N n=1 ξ n The corresponding Lagrangian is given L(w, b, α) = 1 2 w 2 + C subject to t n g(x n ) 1 ξ n for all n = 1, 2,..., N N ξ n n=1 N α n [t n g(x n ) 1 + ξ n ] n=1 where α n 0 and β n 0 are Lagrange multipliers. N β n ξ n Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31 n=1

24 Table of contents 1 Introduction 2 Linear discriminant analysis 3 Linear classifiers 4 Support vector machines 5 Non-linear support vector machine 6 Multi-class Classifiers One-against-all classification One-against-one classification Error correcting coding classification Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

25 Non-linear support vector machine Most data sets are not linearly separable, for example Instances that are not linearly separable in 1 dimension, may be linearly separable in 2 dimensions, for example In this case, we have two solutions Increase dimensionality of data set by introducing mapping φ. Use a more complex model for classifier. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

26 Non-linear support vector machine (cont.) To solve the non-linearly separable dataset, we use mapping φ. For example, let x = (x 1, x 2 ) T, z = (z 1, z 2.z 3 ) T, and φ : R 2 R 3. If we use mapping z = φ(x) = (x 2 1, 2x 1 x 2, x 2 2 )T, the dataset will be linearly separable in R 3. Mapping dataset to higher dimensions has two major problems. In high dimensions, there is risk of over-fitting. In high dimensions, we have more computational cost. The generalization capability in higher dimension is ensured by using large margin classifiers. The mapping is an implicit mapping not explicit. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

27 Non-linear support vector machine (cont.) The SVM uses the following discriminant function. N g(x) = α n t n xn T x n=1 This solution depends on the dot-product between two pints x i and x j. The operation in high dimensional space φ(x) don t have performed explicitly if we can find a function K(x i, x j ) such that K(x i, x j ) = φ(x i ) T φ(x j ). K(x i, x j ) is called kernel in the SVM. Suppose x, z R D and consider the following kernel ( ) 2 K(x, z) = x T z It is a valid kernel because K(x, z) = = ( D i=1 D i=1 j=1 x i z i ) D x j z j j=1 D (x i x j ) (z i z j ) = φ(x) T φ(z) where the mapping φ for D = 2 is : φ(x) = (x 1 x 1, x 1 x 2, x 2 x 1, x 2 x 2 ) T Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

28 Non-linear support vector machine (cont.) Show that kernel K(x, z) = (x T z + c) 2 is a valid kernel. A kernel K is valid if there is some mapping φ such that K(x, z) = φ(x) T φ(z). Assume that K is a valid kernel. Consider a set of N points, K is N N square matrix defined as K = k(x 1, x 1 ) k(x 1, x 2 ) k(x 1, x N ) k(x 2, x 1 ) k(x 2, x 2 ) k(x 2, x N ) k(x N, x 1 ) k(x N, x 2 ) k(x N, x N ) K is called kernel matrix. If K is a valid kernel then k ij = k(x i, x j ) = φ(x i ) T φ(x j ) = φ(x j ) T φ(x i ) = k(x j, x i ) = k ji Thus K is symmetric. It can also be shown that K is positive semi-definite (show it.). Thus if K is a valid kernel, then the corresponding kernel matrix is symmetric positive semi-definite. It is both necessary and sufficient conditions for K to be a valid kernel. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

29 Non-linear support vector machine (cont.) Theorem (Mercer) Assume that K : R D R D R. Then for K to be a valid (Mercer) kernel, it is necessary and sufficient that for any {x 1, x 2,..., x N }, (N > 1) the corresponding kernel matrix is symmetric positive semi-definite. Some valid kernel functions Polynomial kernels K(x, z) = (x T z + 1) p p is the degree of the polynomial and specified by the user. Radial basis function kernels ) x z 2 K(x, z) = exp ( 2σ 2 The width σ is specified by the user. This kernel corresponds to an infinite dimensional mapping φ. Sigmoid Kernel K(x, z) = tanh ( β 0 x T z + β 1 ) This kernel only meets Mercers condition for certain values of β 0 and β 1. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

30 Table of contents 1 Introduction 2 Linear discriminant analysis 3 Linear classifiers 4 Support vector machines 5 Non-linear support vector machine 6 Multi-class Classifiers One-against-all classification One-against-one classification Error correcting coding classification Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

31 Multi-class Classifiers In classification, the goal is to find a mapping from inputs X to outputs t {1, 2,..., C} given a labeled set of input-output pairs. In all our discussions, so far,we have been involved with the two-class classification task, i.e. C = 2. We can extend the binary classifiers to C class classification problems or use the binary classifiers. In an C-class problem, we have the following three extensions for using binary classifiers. One-against-all: This approach is a straightforward extension of two-class problem and considers it as a of C two-class problems. One-against-one: In this approach,c(c 1)/2 binary classifiers are trained and each classifier separates a pair of classes. The decision is made on the basis of a majority vote. Error correcting coding: For a C class problem a number of L binary classifiers are used,where L is appropriately chosen by the designer. Each class is now represented by a binary code word of length L. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

32 One-against-all classification The extension is to consider a set of C two-class problems. For each class, we seek to design an optimal discriminant function, g i (x) (for i = 1, 2,..., C) so that g i (x) > g j (x), j i, if x C i. Adopting the SVM methodology, we can design the discriminant functions so that g i (x) = 0 to be the optimal hyperplane separating class C i from all the others. Thus, each classifier is designed to give g i (x) > 0 for x C i and g i (x) < 0 otherwise. Classification is then achieved according to the following rule: Assign x to class C i if i = argmaxg k (x) k Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

33 Properties of one-against-all classification The number of classifiers equals to C. Each binary classifier deals with a rather asymmetric problem in the sense that training is carried out with many more negative than positive examples. This becomes more serious when the number of classes is relatively large. This technique, however,may lead to indeterminate regions, where more than one g i (x) is positive Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

34 One-against-one classification In this case,c(c 1)/2 binary classifiers are trained and each classifier separates a pair of classes. The decision is made on the basis of a majority vote. The obvious disadvantage of the technique is that a relatively large number of binary classifiers has to be trained. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

35 Error correcting coding classification In this approach, the classification task is treated in the context of error correcting coding. For a C class problem a number of, say, L binary classifiers are used,where L is appropriately chosen by the designer. Each class is now represented by a binary code word of length L. During training, for the i th classifier, i = 1, 2,..., L, the desired labels, t, for each class are chosen to be either 1 or +1. For each class, the desired labels may be different for the various classifiers. This is equivalent to constructing a matrix C L of desired labels. For example, if C = 4 and L = 6, such a matrix can be Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

36 Error correcting coding classification (cont.) For example, if C = 4 and L = 6, such a matrix can be During training, the first classifier (corresponding to the first column of the previous matrix) is designed in order to respond ( 1, +1, +1, 1) for examples of classes C 1, C 2, C 3, C 4, respectively. The second classifier will be trained to respond ( 1, 1, +1, 1), and so on. The procedure is equivalent to grouping the classes into L different pairs, and, for each pair, we train a binary classifier accordingly. Each row must be distinct and corresponds to a class. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

37 Error correcting coding classification (cont.) When an unknown pattern is presented, the output of each one of the binary classifiers is recorded, resulting in a code word. Then,the Hamming distance (number of places where two code words differ) of this code word is measured against the C code words, and the pattern is classified to the class corresponding to the smallest distance. This feature is the power of this technique. If the code words are designed so that the minimum Hamming distance between any pair of them is, say, d, then a correct decision will still be reached even if the decisions of at most d 1 2 out of the L, classifiers are wrong. Hamid Beigy (Sharif University of Technology) Data Mining Fall / 31

Linear & nonlinear classifiers

Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table