Linear Methods for Classification

Size: px

Start display at page:

Download "Linear Methods for Classification"

Alaina Hill
6 years ago
Views:

1 Linear Methods for Classification Department of Statistics The Pennsylvania State University

2 Classification Supervised learning Training data: {(x 1, g 1 ), (x 2, g 2 ),..., (x N, g N )} The feature vector X = (X 1, X 2,..., X p ), where each variable X j is quantitative. The response variable G is categorical. G G = {1, 2,..., K} Form a predictor G(x) to predict G based on X. spam: G has only two values, say 1 denoting a useful and 2 denoting a junk . X is a 57-dimensional vector, each element being the relative frequency of a word or a punctuation mark. G(x) divides the input space (feature vector space) into a collection of regions, each labeled by one class.

4 Linear Methods Decision boundaries are linear: linear methods for classification. Two class problem: The decision boundary between the two classes is a hyperplane in the feature vector space. A hyperplane in the p dimensional input space is the set: p {x : α 0 + α j x j = 0}. j=1 The two regions separated by a hyperplane: p {x : α 0 + α j x j > 0} and {x : α 0 + j=1 p α j x j < 0}. j=1

5 More than two classes: The decision boundary between any pair of class k and l is a hyperplane (shown in previous figure). Question: which hyperplanes to use? Different criteria lead to different algorithms. Linear regression of an indicator matrix. Linear discriminant analysis. Logistic regression. Rosenblatt s perceptron learning algorithm. Linear decision boundaries are not necessarily constrained.

6 The Bayes Classification Rule Suppose the marginal distribution of G is specified by the pmf p G (g), g = 1, 2,..., K. The conditional distribution of X given G = g is f X G (x G = g). The training data (x i, g i ), i = 1, 2,..., N, are independent samples from the joint distribution of X and G, f X,G (x, g) = p G (g)f X G (x G = g). Assume the loss of predicting G as G(X ) = Ĝ is L(Ĝ, G). Goal of classification: minimize the expected loss E X,G L(G(X ), G) = E X (E G X L(G(X ), G)).

7 To minimize the left hand side, it suffices to minimize E G X L(G(X ), G) for each X. Hence the optimal predictor Bayes classification rule. G(x) = arg min g E G X =x L(g, G).

8 For 0-1 loss, i.e., { L(g, g 1 g g ) = 0 g = g E G X =x L(g, G) = 1 Pr(G = g X = x). The Bayes rule becomes the rule of maximum a posteriori probability: G(x) = arg min g E G X =x L(g, G) = arg max Pr(G = g X = x). g Many classification algorithms attempt to estimate Pr(G = g X = x), then apply the Bayes rule.

9 Linear Regression of an Indicator Matrix If G has K classes, there will be K class indicators Y k, k = 1,..., K. g y 1 y 2 y 3 y Fit a linear regression model for each Y k, k = 1, 2,..., K, using X : ŷ k = X(X T X) 1 X T y k. Define Y = (y 1, y 2,..., y k ): Ŷ = X(X T X) 1 X T Y.

10 Classification Procedure Define ˆB = (X T X) 1 X T Y. For a new observation with input x, compute the fitted output ˆf (x) = [(1, x)ˆb] T = [(1, x 1, x 2,..., x p )ˆB] T ˆf 1 (x) = ˆf 2 (x)... ˆf K (x) Identify the largest component of ˆf (x) and classify accordingly: Ĝ(x) = arg max ˆf k (x). k G

11 Rationale The linear regression of Y k on X is a linear approximation to E(Y k X = x). E(Y k X = x) = Pr(Y k = 1 X = x) 1 + Pr(Y k = 0 X = x) 0 = Pr(Y k = 1 X = x) = Pr(G = k X = x) According to the Bayes rule, the optimal classifier: G (x) = arg max Pr(G = k X = x). k G Linear regression of an indicator matrix: Approximate Pr(G = k X = x) by a linear function of x using linear regression. Apply the Bayes rule to the approximated probability.

12 Example: Diabetes Data The diabetes data set is taken from the UCI machine learning database repository at: mlearn/machine-learning.html. The original source of the data is the National Institute of Diabetes and Digestive and Kidney Diseases. There are 768 cases in the data set, of which 268 show signs of diabetes according to World Health Organization criteria. Each case contains 8 quantitative variables, including diastolic blood pressure, triceps skin fold thickness, a body mass index, etc. Two classes: with or without signs of diabetes. Denote the 8 original variables by X 1, X 2,..., X 8. Remove the mean of X j and normalize it to unit variance.

13 The two principal components X 1 and X 2 are used in classification: X 1 = X X X X X X X X 8 X 2 = X X X X X X X X 8

14 The scatter plot follows. Without diabetes: stars (class 1), with diabetes: circles (class 2).

15 ˆB = (X T X) 1 X T Y = Note Ŷ 1 + Ŷ 2 = 1. Ŷ 1 = X X 2 Ŷ 2 = X X 2

16 Classification rule Ĝ(x) = = { 1 Ŷ1 Ŷ 2 2 Ŷ 1 < { Ŷ X X otherwise Within training data classification error rate: 28.52%. Sensitivity (probability of claiming positive when the truth is positive): 44.03%. Specificity (probability of claiming negative when the truth is negative): 86.20%.

18 The Phenomenon of Masking When the number of classes K 3, a class may be masked by others, that is, there is no region in the feature space that is labeled as this class. The linear regression model is too rigid.

CMSC858P Supervised Learning Methods

CMSC858P Supervised Learning Methods Hector Corrada Bravo March, 2010 Introduction Today we discuss the classification setting in detail. Our setting is that we observe for each subject i a set of p predictors