Linear Decision Boundaries

Similar documents
Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Lecture 9: Classification, LDA

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Lecture 9: Classification, LDA

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Support Vector Machines

Linear Methods for Prediction

Classification 2: Linear discriminant analysis (continued); logistic regression

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Statistical Machine Learning Hilary Term 2018

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning Linear Classification. Prof. Matteo Matteucci

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

CMSC858P Supervised Learning Methods

Introduction to Machine Learning

Machine Learning (CS 567) Lecture 5

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Lecture 9: Classification, LDA

Classification: Linear Discriminant Analysis

Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:

Logistic Regression. Seungjin Choi

Linear Methods for Prediction

The Bayes classifier

LDA, QDA, Naive Bayes

Statistical Methods for SVM

Terminology for Statistical Data

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Statistical Data Mining and Machine Learning Hilary Term 2016

Machine Learning for OR & FE

Introduction to Machine Learning

Linear Models for Classification

MATH 567: Mathematical Techniques in Data Science Logistic regression and Discriminant Analysis

Linear Regression and Discrimination

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Linear Methods for Classification

Classification 1: Linear regression of indicators, linear discriminant analysis

Linear Models for Classification

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Introduction to Machine Learning Spring 2018 Note 18

Introduction to Machine Learning

Logistic Regression and Generalized Linear Models

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning Lecture 7

LEC 4: Discriminant Analysis for Classification

Logistic Regression. Advanced Methods for Data Analysis (36-402/36-608) Spring 2014

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

FSAN815/ELEG815: Foundations of Statistical Learning

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

Lecture 6: Methods for high-dimensional problems

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Proteomics and Variable Selection

Direct Learning: Linear Classification. Donglin Zeng, Department of Biostatistics, University of North Carolina

10-701/ Machine Learning - Midterm Exam, Fall 2010

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Introduction to Machine Learning

5. Discriminant analysis

CS 195-5: Machine Learning Problem Set 1

COMS 4771 Regression. Nakul Verma

Machine Learning for OR & FE

Machine Learning Lecture 5

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Linear Classification

Linear Regression (9/11/13)

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Kernel Logistic Regression and the Import Vector Machine

STA 4273H: Statistical Machine Learning

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Advanced Introduction to Machine Learning

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning 2017

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Linear Regression Models P8111

Machine Learning Practice Page 2 of 2 10/28/13

CSCI-567: Machine Learning (Spring 2019)

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Reading Group on Deep Learning Session 1

Applied Multivariate and Longitudinal Data Analysis

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

Lecture 5: LDA and Logistic Regression

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

Ch 4. Linear Models for Classification

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Machine Learning. 7. Logistic and Linear Regression

Spring 2006: Linear Discriminant Analysis, Etc.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

High-Throughput Sequencing Course

MS&E 226: Small Data

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Linear Discrimination Functions

STA 414/2104, Spring 2014, Practice Problem Set #1

Kernel Methods and Support Vector Machines

Transcription:

Linear Decision Boundaries A basic approach to classification is to find a decision boundary in the space of the predictor variables. The decision boundary is often a curve formed by a regression model: which we often take as linear: y i = f(x i ) + ɛ i, y i = β 0 + β 1 x 1i + + β p x pi + ɛ i β 0 + β T x i. We often denote the decision function for the k th class as δ k (x). It is in the context of such regression models that we considered the problem of building the model by variable selection regularization dervied input directions. 1

Decision Boundaries from Regression Models How do we use the regression model to form a decision boundary? If we have K classes, the basic idea is to fit the model y i β 0 + β T x i separately in each of the classes; that is, we have ŷ = ˆf k (x) = β k0 + β k T x, for each k = 1,..., K. We use the same predictor variables in all classes. The decision boundary between classes j and k is the set of points where ˆf j (x) = ˆf k (x). This is a hyperplane: β j0 β k0 + ( β j β k ) T x = 0. 2

Decision Boundaries from Regression Models Instead of the linear regression model, we could use a generalized linear model. The simplest one of these is for the situation where K = 2. This leads to logistic regression, where we begin with the probability of being in one class, then form the odds, and then model the log odds. (This is called the logit transformation of the probability.) (Note that in the general classification problem, we often develop methods for K = 2, and use them for K > 2 by sequentially forming two groups consisting of one class against all of the remaining ones.) 3

Decision Boundaries from Regression Models for Indicator Variables We form an indicator matrix, Y, in which the columns are associated with the classes, and for a given observation, we represent its class by a 1 in the appropriate column and 0 s in all of the other columns. The linear regression model for each column of Y is y = Xβ (where X contains a column of 1 s), but we can put them all together as Y = XB, where Y is N K, X is N (p + 1), and B is (p + 1) K. Fitting this multivariate multiple linear regression model by least squares is done exactly the same as a univariate regression model. 4

Decision Boundaries from Regression Models for Indicator Variables Note that the columns of B are the β s of the univariate models. We have B = (X T X) 1 X T Y, and, for a new set of observations with predictor variables (and a column of 1 s) in the j (p + 1) matrix X 0, Ŷ = X 0 (X T X) 1 X T Y, (Recall our convention is to use x to represent a p-vector, and X to represent a matrix with (1, x) T in the rows. The predicted class for an observation with predictor variables x 0 is Ĝ(x 0 ) = argmax (1, x 0 ) T B. k {1,...,K} 5

Decision Boundaries from Regression Models for Targets The approach we have just described seems very reasonable, and we can develop this same method beginnig with different rationales, for example, where we for targets, t k, one for each class. The targets are just vectors with all zeros except for a 1 in the k th position. This results in the same coding as we have described and choosing the class based on the a least squares criterion applied to the targets and the fitted prediction. (The fitted prediction is the same as before, that is, (1, x 0 ) T B.) 6

Decision Boundaries from Regression Models for Indicator Variables Notice that the individual predictors always sum to 1, although some could be negative. All of this seems pretty reasonable, but how does it work in practice? Now very well if K > 2. (The old non-binary classification problem!) See Figure 4.2 in HTF, and the discussion about it. Let s try it. 7

Decision Boundaries from Regression Models for Indicator Variables ns<-c(100,100,100) n<-sum(ns) d<-5 set.seed(555) x<-matrix(rnorm(2*n),ncol=2) # move mean of first group x[1:ns[1],]=x[1:ns[1],]+c(-d,-d) # move mean of third group x[(ns[1]+ns[2]+1):n,]=x[(ns[1]+ns[2]+1):n,]+c(d,d) # set class indicator g<-c(rep(1,ns[1]),rep(2,ns[2]),rep(3,ns[3])) plot(x,col=g) 8

Based on the observed values of x 1 and x 2, we see good linear separation. x[,2] 5 0 5 6 4 2 0 2 4 6 x[,1] 9

Decision Boundaries from Regression Models for Indicator Variables Let s fit: # first form Y matrix y<-matrix(c(rep(1,ns[1]),rep(0,ns[2]),rep(0,ns[3]), rep(0,ns[1]),rep(1,ns[2]),rep(0,ns[3]), rep(0,ns[1]),rep(0,ns[2]),rep(1,ns[3])),ncol=3) lmfit<-lm(y~x) lmfit Coefficients: [,1] [,2] [,3] (Intercept) 0.338052 0.332820 0.329128 x1-0.045579-0.007448 0.053027 x2-0.051645 0.007945 0.043700 10

Now let s look at some of those in the first group: g<-1 for (i in 1:5){ x0<-x[g+i,] pred<-c(lmfit$coef[,1]%*%c(1,x0), lmfit$coef[,2]%*%c(1,x0), lmfit$coef[,3]%*%c(1,x0)) print(pred) } [1] 0.7652977 0.3321071-0.0974049 [1] 0.8186280 0.3257718-0.1443999 [1] 0.7662933 0.3119272-0.0782206 [1] 0.8390860 0.3537753-0.1928613 [1] 0.8466014 0.3140754-0.1606768 11

Now let s look at some of those in the third group: g<-ns[1]+ns[2] for (i in 1:5){ x0<-x[g+i,] pred<-c(lmfit$coef[,1]%*%c(1,x0), lmfit$coef[,2]%*%c(1,x0), lmfit$coef[,3]%*%c(1,x0)) print(pred) } [1] -0.1284378 0.3104595 0.8179784 [1] -0.0907748 0.3368873 0.7538874 [1] -0.1163448 0.3309336 0.7854112 [1] -0.1318059 0.3436167 0.7881892 [1] -0.0742736 0.3421001 0.7321734 12

Now let s look at some in the middle group: g<-ns[1] for (i in 1:5){ x0<-x[g+i,] pred<-c(lmfit$coef[,1]%*%c(1,x0), lmfit$coef[,2]%*%c(1,x0), lmfit$coef[,3]%*%c(1,x0)) print(pred) } [1] 0.3289085 0.3458270 0.3252646 [1] 0.4318694 0.3402608 0.2278698 ** wrong [1] 0.2554730 0.3450800 0.3994471 ** wrong [1] 0.3162268 0.3269453 0.3568279 ** wrong [1] 0.1961385 0.3051782 0.4986833 ** wrong 13

Decision Boundaries from Regression Models for Indicator Variables This idea worked well when K was 2. What s wrong when K > 2, and what to do? The best separation line is as shown in the left panel of Figure 4.2 in HTF. (Notice BTW that there are two lines shown in the right panel of Figure 4.2, but we have three regression hyperplanes. The first line is the projection of the intersection of the first two hyperplanes, and the second line is the projection of the intersection of the second and third hyperplanes.) The linearity that is incorporated into the model is the problem when K > 2. 14

Decision Boundaries from Regression Models for Indicator Variables This problem also depends on p. How can we fix this? Increase p artificially by including quadratic terms in the model. See Figure 4.3 in HTF. The dimension of the predictor space is now 2p + 1. This works if K = 3, as in our case, but if K = 4, we need cubic terms. In general, we need terms to the power K 1. This is the germ of the idea of forming separating hyperplanes in higher dimensions, which is a basic element of support vector machines. 15

Discriminant Analysis Given an observation on a predictor variable X, our interest is in the conditional probability distribution of the class variable G. If f k (x) is the probability, G = k, and X = x, then we have Pr(G = k X = x) = = Pr(G = k and X = x) Pr(X = x) f k (x)π k Kj=1 f j (x)π j, where π j are prior probabilities of being in one class or another. 16

Discriminant Analysis Incorporating prior weights is straightforward, and we can use these prior probabilities, which can arise either from prior beliefs (from a Bayesian perspective) or from some known or assumed distribution of the relative probabilities of being in each class. The most common way of assigning prior probabilities is to use the relative proportion of a class in the training set as the prior probability of that class. We often omit them, that is, assume that they are all 1. 17

Discriminant Analysis While the relation Pr(G = k X = x) = f k(x) Kj=1 f j (x) makes sense, we need to choose how to use it. There are several possibilities that will be explored in Chapter 6 of HTF (which we probably will not cover), and there is a very simple one (in Chapter 4), which goes back to the early days of Statistics as a science. An important first step is to extend the idea of probability to probability density. (Although this extension appears reasonable, the justification is beyond the scope of this course.) 18

Discriminant Analysis Suppose that the predictor variables in each class have a p-variate normal distribution with the same variance-covariance matrix, but just with different means; that is, the probability is the PDF f k (x) = 1 (2π) p/2 Σ 1/2e 1 2 (x µ k) T Σ 1 (x µ k ). This leads to linear discriminant analysis, or LDA. For two classes, k and j, we want to compare Pr(G = k X = x) with Pr(G = j X = x), but that is just comparing f k (x) with f j (x). 19

Discriminant Analysis The ratio Pr(G = k X = x)/pr(g = j X = x) is just f k (x)/f j (x), and that simplifies because the constant out front cancels; furthermore, if we take the log of the ratio, we have, after some rearrangement, ( ) Pr(G = k X = x) log = 1 Pr(G = j X = x) 2 (µ k µ j ) T Σ 1 (µ k µ j )+x T Σ 1 (x µ k ) This means that the decision boundary between classes k and j is just the hyperplane x T Σ 1 (µ k µ j ) = 1 2 (µ k µ j ) T Σ 1 (µ k µ j ). The decision function for the k th class is δ k (x) = x T Σ 1 µ k 1 2 µt k Σ 1 µ k. 20

Discriminant Analysis The linear discriminant functions tessellate the p space of the predictors. For p = 2 and 3 classes, for example, we get the picture in the right panel of Figure 4.5 of HTF. Now, how can we use this in practice since we don t know Σ, µ k, or µ j? Simple, we estimate these from the sample. This is LDA; in R, it s lda (in the MASS package). (See my notes at the link Linear classification in R; the vowel data in Week 3 on the class website. Notice that lda allows specification of prior probabilities. 21

Discriminant Analysis What next? Well, suppose that the variance-covariance matrices are different. No problem, we can estimate them separately from the individual classes in the training data. The constant terms in the ratio do not cancel, however, and so the discriminant function has an additional term, and the x enters quadratically: δ k (x) = 1 2 log( Σ k ) 1 2 (x µ k) T Σ 1 (x µ k ). This is called quadratic discriminant analysis, QDA; in R, it s qda (in the MASS package). 22

Linear Classification Using Logistic Regression Beginning with the same idea as before, that is, of looking at the ratio Pr(G = k X = x)/pr(g = j X = x), we may form the log odds; that is, we use the logit transformation. In the simple development of these ideas, we work with two groups. We let G take on the values 0 or 1 only, and let and so p(x) = E(G X = x). p(x) = Pr(G = 1 X = x), Now define Now, or ( ) p logit(p) = log. 1 p logit(p) X T β p e XTβ. 23

Linear Classification Using Logistic Regression If we have K > 2 groups, and we focus on Pr(G = k X = x)/pr(g = j X = x), we can form K 1 log odds: ( ) Pr(G = 1 X = x) log = β 10 + β1 T Pr(G = K X = x) x ( ) Pr(G = 2 X = x) log = β 20 + β2 T Pr(G = K X = x) x ( ). Pr(G = K 1 X = x) log = β Pr(G = K X = x) (K 1)0 + βk 1 T x 24

Linear Classification Using Logistic Regression The numerators within the log sum to 1 Pr(G = K X = x), and we can write the individual conditional probabilities as Pr(G = k X = x) = Pr(G = K X = x) = exp(β k0 + βk Tx) 1 + K 1 j=1 exp(β j0 + βj T for k = 1,...,K 1 x), 1 1 + K 1 j=1 exp(β j0 + βj Tx). We fit this model by maximum likelihood. (What is the probability distribution?) Use Newton s method (an iterative optimization algorithm). In R, generalized linear regression models are fit by glm. 25

Linear Classification with Separating Hyperplanes If the classes are separable by a hyperplane, there are many ways of finding a hyperplane that falls between the classes, but it is still a difficult problem in high dimensions. One method, called a perceptron algorithm, begins with a hyperplane and then adjusts it iteratively based by minimizing the distance between the hyperplane and the misclassified points. The idea is simple (and it led to more complicated neural network algorithms), but the method is rather unstable. It may not converge. (In computerese, it is not an algorithm.) 26

Linear Classification Refer to the link Linear classification in R; the vowel data in Week 4 on the class website. 27