Linear Decision Boundaries

Size: px

Start display at page:

Download "Linear Decision Boundaries"

Mabel Perkins
6 years ago
Views:

1 Linear Decision Boundaries A basic approach to classification is to find a decision boundary in the space of the predictor variables. The decision boundary is often a curve formed by a regression model: which we often take as linear: y i = f(x i ) + ɛ i, y i = β 0 + β 1 x 1i + + β p x pi + ɛ i β 0 + β T x i. We often denote the decision function for the k th class as δ k (x). It is in the context of such regression models that we considered the problem of building the model by variable selection regularization dervied input directions. 1

2 Decision Boundaries from Regression Models How do we use the regression model to form a decision boundary? If we have K classes, the basic idea is to fit the model y i β 0 + β T x i separately in each of the classes; that is, we have ŷ = ˆf k (x) = β k0 + β k T x, for each k = 1,..., K. We use the same predictor variables in all classes. The decision boundary between classes j and k is the set of points where ˆf j (x) = ˆf k (x). This is a hyperplane: β j0 β k0 + ( β j β k ) T x = 0. 2

3 Decision Boundaries from Regression Models Instead of the linear regression model, we could use a generalized linear model. The simplest one of these is for the situation where K = 2. This leads to logistic regression, where we begin with the probability of being in one class, then form the odds, and then model the log odds. (This is called the logit transformation of the probability.) (Note that in the general classification problem, we often develop methods for K = 2, and use them for K > 2 by sequentially forming two groups consisting of one class against all of the remaining ones.) 3

4 Decision Boundaries from Regression Models for Indicator Variables We form an indicator matrix, Y, in which the columns are associated with the classes, and for a given observation, we represent its class by a 1 in the appropriate column and 0 s in all of the other columns. The linear regression model for each column of Y is y = Xβ (where X contains a column of 1 s), but we can put them all together as Y = XB, where Y is N K, X is N (p + 1), and B is (p + 1) K. Fitting this multivariate multiple linear regression model by least squares is done exactly the same as a univariate regression model. 4

5 Decision Boundaries from Regression Models for Indicator Variables Note that the columns of B are the β s of the univariate models. We have B = (X T X) 1 X T Y, and, for a new set of observations with predictor variables (and a column of 1 s) in the j (p + 1) matrix X 0, Ŷ = X 0 (X T X) 1 X T Y, (Recall our convention is to use x to represent a p-vector, and X to represent a matrix with (1, x) T in the rows. The predicted class for an observation with predictor variables x 0 is Ĝ(x 0 ) = argmax (1, x 0 ) T B. k {1,...,K} 5

6 Decision Boundaries from Regression Models for Targets The approach we have just described seems very reasonable, and we can develop this same method beginnig with different rationales, for example, where we for targets, t k, one for each class. The targets are just vectors with all zeros except for a 1 in the k th position. This results in the same coding as we have described and choosing the class based on the a least squares criterion applied to the targets and the fitted prediction. (The fitted prediction is the same as before, that is, (1, x 0 ) T B.) 6

7 Decision Boundaries from Regression Models for Indicator Variables Notice that the individual predictors always sum to 1, although some could be negative. All of this seems pretty reasonable, but how does it work in practice? Now very well if K > 2. (The old non-binary classification problem!) See Figure 4.2 in HTF, and the discussion about it. Let s try it. 7

8 Decision Boundaries from Regression Models for Indicator Variables ns<-c(100,100,100) n<-sum(ns) d<-5 set.seed(555) x<-matrix(rnorm(2*n),ncol=2) # move mean of first group x[1:ns[1],]=x[1:ns[1],]+c(-d,-d) # move mean of third group x[(ns[1]+ns[2]+1):n,]=x[(ns[1]+ns[2]+1):n,]+c(d,d) # set class indicator g<-c(rep(1,ns[1]),rep(2,ns[2]),rep(3,ns[3])) plot(x,col=g) 8

9 Based on the observed values of x 1 and x 2, we see good linear separation. x[,2] x[,1] 9

10 Decision Boundaries from Regression Models for Indicator Variables Let s fit: # first form Y matrix y<-matrix(c(rep(1,ns[1]),rep(0,ns[2]),rep(0,ns[3]), rep(0,ns[1]),rep(1,ns[2]),rep(0,ns[3]), rep(0,ns[1]),rep(0,ns[2]),rep(1,ns[3])),ncol=3) lmfit<-lm(y~x) lmfit Coefficients: [,1] [,2] [,3] (Intercept) x x

11 Now let s look at some of those in the first group: g<-1 for (i in 1:5){ x0<-x[g+i,] pred<-c(lmfit$coef[,1]%*%c(1,x0), lmfit$coef[,2]%*%c(1,x0), lmfit$coef[,3]%*%c(1,x0)) print(pred) } [1] [1] [1] [1] [1]

12 Now let s look at some of those in the third group: g<-ns[1]+ns[2] for (i in 1:5){ x0<-x[g+i,] pred<-c(lmfit$coef[,1]%*%c(1,x0), lmfit$coef[,2]%*%c(1,x0), lmfit$coef[,3]%*%c(1,x0)) print(pred) } [1] [1] [1] [1] [1]

13 Now let s look at some in the middle group: g<-ns[1] for (i in 1:5){ x0<-x[g+i,] pred<-c(lmfit$coef[,1]%*%c(1,x0), lmfit$coef[,2]%*%c(1,x0), lmfit$coef[,3]%*%c(1,x0)) print(pred) } [1] [1] ** wrong [1] ** wrong [1] ** wrong [1] ** wrong 13

14 Decision Boundaries from Regression Models for Indicator Variables This idea worked well when K was 2. What s wrong when K > 2, and what to do? The best separation line is as shown in the left panel of Figure 4.2 in HTF. (Notice BTW that there are two lines shown in the right panel of Figure 4.2, but we have three regression hyperplanes. The first line is the projection of the intersection of the first two hyperplanes, and the second line is the projection of the intersection of the second and third hyperplanes.) The linearity that is incorporated into the model is the problem when K > 2. 14

15 Decision Boundaries from Regression Models for Indicator Variables This problem also depends on p. How can we fix this? Increase p artificially by including quadratic terms in the model. See Figure 4.3 in HTF. The dimension of the predictor space is now 2p + 1. This works if K = 3, as in our case, but if K = 4, we need cubic terms. In general, we need terms to the power K 1. This is the germ of the idea of forming separating hyperplanes in higher dimensions, which is a basic element of support vector machines. 15

16 Discriminant Analysis Given an observation on a predictor variable X, our interest is in the conditional probability distribution of the class variable G. If f k (x) is the probability, G = k, and X = x, then we have Pr(G = k X = x) = = Pr(G = k and X = x) Pr(X = x) f k (x)π k Kj=1 f j (x)π j, where π j are prior probabilities of being in one class or another. 16

17 Discriminant Analysis Incorporating prior weights is straightforward, and we can use these prior probabilities, which can arise either from prior beliefs (from a Bayesian perspective) or from some known or assumed distribution of the relative probabilities of being in each class. The most common way of assigning prior probabilities is to use the relative proportion of a class in the training set as the prior probability of that class. We often omit them, that is, assume that they are all 1. 17

18 Discriminant Analysis While the relation Pr(G = k X = x) = f k(x) Kj=1 f j (x) makes sense, we need to choose how to use it. There are several possibilities that will be explored in Chapter 6 of HTF (which we probably will not cover), and there is a very simple one (in Chapter 4), which goes back to the early days of Statistics as a science. An important first step is to extend the idea of probability to probability density. (Although this extension appears reasonable, the justification is beyond the scope of this course.) 18

19 Discriminant Analysis Suppose that the predictor variables in each class have a p-variate normal distribution with the same variance-covariance matrix, but just with different means; that is, the probability is the PDF f k (x) = 1 (2π) p/2 Σ 1/2e 1 2 (x µ k) T Σ 1 (x µ k ). This leads to linear discriminant analysis, or LDA. For two classes, k and j, we want to compare Pr(G = k X = x) with Pr(G = j X = x), but that is just comparing f k (x) with f j (x). 19

20 Discriminant Analysis The ratio Pr(G = k X = x)/pr(g = j X = x) is just f k (x)/f j (x), and that simplifies because the constant out front cancels; furthermore, if we take the log of the ratio, we have, after some rearrangement, ( ) Pr(G = k X = x) log = 1 Pr(G = j X = x) 2 (µ k µ j ) T Σ 1 (µ k µ j )+x T Σ 1 (x µ k ) This means that the decision boundary between classes k and j is just the hyperplane x T Σ 1 (µ k µ j ) = 1 2 (µ k µ j ) T Σ 1 (µ k µ j ). The decision function for the k th class is δ k (x) = x T Σ 1 µ k 1 2 µt k Σ 1 µ k. 20

21 Discriminant Analysis The linear discriminant functions tessellate the p space of the predictors. For p = 2 and 3 classes, for example, we get the picture in the right panel of Figure 4.5 of HTF. Now, how can we use this in practice since we don t know Σ, µ k, or µ j? Simple, we estimate these from the sample. This is LDA; in R, it s lda (in the MASS package). (See my notes at the link Linear classification in R; the vowel data in Week 3 on the class website. Notice that lda allows specification of prior probabilities. 21

22 Discriminant Analysis What next? Well, suppose that the variance-covariance matrices are different. No problem, we can estimate them separately from the individual classes in the training data. The constant terms in the ratio do not cancel, however, and so the discriminant function has an additional term, and the x enters quadratically: δ k (x) = 1 2 log( Σ k ) 1 2 (x µ k) T Σ 1 (x µ k ). This is called quadratic discriminant analysis, QDA; in R, it s qda (in the MASS package). 22

23 Linear Classification Using Logistic Regression Beginning with the same idea as before, that is, of looking at the ratio Pr(G = k X = x)/pr(g = j X = x), we may form the log odds; that is, we use the logit transformation. In the simple development of these ideas, we work with two groups. We let G take on the values 0 or 1 only, and let and so p(x) = E(G X = x). p(x) = Pr(G = 1 X = x), Now define Now, or ( ) p logit(p) = log. 1 p logit(p) X T β p e XTβ. 23

24 Linear Classification Using Logistic Regression If we have K > 2 groups, and we focus on Pr(G = k X = x)/pr(g = j X = x), we can form K 1 log odds: ( ) Pr(G = 1 X = x) log = β 10 + β1 T Pr(G = K X = x) x ( ) Pr(G = 2 X = x) log = β 20 + β2 T Pr(G = K X = x) x ( ). Pr(G = K 1 X = x) log = β Pr(G = K X = x) (K 1)0 + βk 1 T x 24

25 Linear Classification Using Logistic Regression The numerators within the log sum to 1 Pr(G = K X = x), and we can write the individual conditional probabilities as Pr(G = k X = x) = Pr(G = K X = x) = exp(β k0 + βk Tx) 1 + K 1 j=1 exp(β j0 + βj T for k = 1,...,K 1 x), K 1 j=1 exp(β j0 + βj Tx). We fit this model by maximum likelihood. (What is the probability distribution?) Use Newton s method (an iterative optimization algorithm). In R, generalized linear regression models are fit by glm. 25

26 Linear Classification with Separating Hyperplanes If the classes are separable by a hyperplane, there are many ways of finding a hyperplane that falls between the classes, but it is still a difficult problem in high dimensions. One method, called a perceptron algorithm, begins with a hyperplane and then adjusts it iteratively based by minimizing the distance between the hyperplane and the misclassified points. The idea is simple (and it led to more complicated neural network algorithms), but the method is rather unstable. It may not converge. (In computerese, it is not an algorithm.) 26

27 Linear Classification Refer to the link Linear classification in R; the vowel data in Week 4 on the class website. 27

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Chap 2. Linear Classifiers (FTH, 4.1-4.4) Yongdai Kim Seoul National University Linear methods for classification 1. Linear classifiers For simplicity, we only consider two-class classification problems