Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem with a d-dimensional input vector and a class variable C taking values {1,..., K}. For simplicity you can consider the components of to all be real-valued. We can define a general form for any classifier as follows: For each class k {1,..., K} we have a discriminant function g k () that produces a scalar-valued discriminant value for each class k The classifier makes a decision on a new by computing the K discriminant values and assigning to the class with the largest value, i.e., k = arg ma g k () k 2 Decision Regions and Decision Boundaries We define the decision region R k for class k, 1 k K, as the region of the input space where the discriminant function for class k is larger than any of the other discriminant functions, i.e., if R k then is classified as being in class k by the classifier, or equivalently, g k () > g j (), j k. Note that a decision region need not be a single contiguous region in but could be the union of multiple disjoint subregions (depending on how the discriminant functions are defined). Decision boundaries between decision regions are defined by equality of the respective discriminant functions. Consider the two-class case for simplicity, with g 1 () and g 2 (). Points in the input space for which g 1 () = g 2 () are by definition on the decision boundary between the two classes. This is equivalent 1
Notes on Discriminant Functions and Optimal Classification 2 to saying that the equation for the decision boundary is defined by g 1 () g 2 () = 0. In fact for the two-class case it is clear that we don t need two discriminant functions, i.e., we can just define a single discriminant function g() and predict class 0 if g() < 0 and class 1 if g( > 0). Note that we have assumed here that is defined in a way such that one of its input is always set to 1 (or some constant), to ensure that we have an intercept term in our models. We define linear discriminants as g k () = θ T k, i.e., an inner product of a weight vector and the input vector, where θ k is a d-dimensional weight vector for class k. In general, linear discriminant functions will lead to piecewise linear decision regions in the input space. In particular, for the two-class case, the decision boundary equation is defined as g 1 () g 2 () = θ T 1 θ T 2 = θ T = 0, i.e., we only need a single weight vector θ. The equation θ T = 0 will in general define the equation of a (d 1)-dimensional hyperplane in the d-dimensional space, partitioning the input space into two contiguous decision regions separated by the linear hyperplane. Eamples of linear discriminants are perceptrons and logistic models. More generally, if the g k () are polynomial functions of order r in, the decision boundaries in the general case will also be polynomials of order r. An eample of a discriminant function that produces quadratic boundaries (r = 2) in the general case is the multivariate Gaussian classifier. And if the g k () are non-linear functions of then we will get non-linear decision boundaries (an eample would be a neural network with one or more hidden layers). 3 Optimal Discriminants Optimal discriminants are discriminant functions that minimize the error rate of a classifiers (here we assume that all misclassification errors incur equal cost more generally we can minimize epected cost where different errors may have different costs). It is not hard to see, that for minimizing the error rate that the optimal discriminant is defined as g k () = p(c k ), 1 k K which is equivalent to maimizing the posterior probability of the class variable C given, i.e., pick the most likely class for each. Note that this is optimal in theory, if we know the true p(c k ) eactly. In practice will usually need to approimate this connditional distribution by assuming some functional form for it (e.g., the logistic form) and learning the parameters for this functional from data. The assumption of a specific functional form may lead to bias (or approimation error) and learning parameters from a data set will lead to variance (or estimation error).
Notes on Discriminant Functions and Optimal Classification 3 From the definition of the optimal discriminant it is easy to see that another version of the optimal discriminant (in the sense that it will lead to the same decision for any ) is defined by g k () = p( c k )p(c k ) (because of Bayes rule) again, this is theoretically optimal if we know it, but in practice we usually do not. And similarly, any monotonic function of these discriminants, such as log p( c k ) + log p(c k ) are also optimal discriminants. 4 The Bayes Error Rate As mentioned above, the optimal discriminant for any classification problem is defined by g k () = p(c k ) (or some variant of this). Even though in practice we won t know the precise functional form or the parameters of p(c k ) it is nonetheless useful to look at the error rate that we would get with the optimal classifier. The error rate of the optimal classifier is known as the Bayes error rate and provide a lower bound on the performance of any actual classifier. As we will see below, the Bayes error rate depends on how much overlap there is between the density functions for each class, p( c k ), in the input space : if there is a lot of overlap we will get a high Bayes error rate, and with little or no overlap we get a low (near zero) Bayes error rate. Consider the error rate of the optimal classifier at some particular point in the input space. The probability of error is e = 1 p(c k ) where k = arg ma k {p(c k )}, i.e., since our optimal classifier will always select k given, the classifier will be correct p(c k ) fraction of the time and incorrect 1 p(c k ) fraction of the time, at. To compute the overall error rate (i.e., the probability that the optimal classifier will make an error on a random drawn from p()) we need to compute the epected value of the error rate with respect to p(): Pe = E p() [e ] = e p()d = ( ) 1 ma {p(c k )} p()d k We can rewrite this in terms of decision regions as P e = K ( ) (1 p(c k ))p()d R k k=1
Notes on Discriminant Functions and Optimal Classification 4 i.e., as the sum of K different error terms defined over the K decision regions. These error terms, one per class k, are proportional to how much overlap class k has with each of the other class densities. In the two-class case we can further simplify this to ( ) ( ) Pe = 1 p(c 1 ) p()d + 1 p(c 2 ) p()d R 1 R 2 = p(c 2 )p()d + p(c 1 )p()d R 1 R 2 Although we can only ever compute this for toy problems where we assume full knowledge of p(c k ) and p(), the concept of the Bayes error rate is nonetheless useful and it can be informative to see how it depends on density overlap. Note that any achievable classifier can never do better than the Bayes error rate in terms of its accuracy: the only way to improve on the Bayes error rate is to change the input feature vector, e.g., to add 1 or more features to the input space that can potentially separate the class densities more. So in principle it would seem as if it should always be a good idea to include as many features as we can in a classification problem, since the Bayes error rate of a higher dimensional space is always at least as low (if not lower) than any subset of dimensions of that space. This is true in theory, in terms of the optimal error rate in the higher-dimensional space: but in practice, when learning from a finite amount of data N, adding more features (more dimensions) means we need to learn more parameters, so our actual classifier in the higherdimensional space could in fact be less accurate (due to estimation noise) than a lower-dimensional classifier, even though the higher-dimensional space might have a lower Bayes error rate. In general, the actual error rate of a classifier can be thought of as having components similar to those for regression. In regresison, for a real-valued y and the squared error loss function, the error rate of a prediction model can be decomposed into a sum of the inherent variability of y given, plus a bias term, plus a variance term. For classification problems the decomposition does not have this simple additive form, but it is nonetheless useful to think of the actual error rate of a classifier as having components that come from (1) the Bayes error rate, (2) the bias (approimation error) of the classifier, and (3) variance (estimation error due to fitting the model to a finite amount of data). 5 Discriminative versus Generative Models The definitions above of optimal discriminants suggest two different strategies for learning classifiers from data. In the first we try to learn p(c k ) directly: this is referred to as the class-conditional or discriminative
Notes on Discriminant Functions and Optimal Classification 5 approach and eamples include logistic regression and neural networks. The second approach, where we learn p( c k ) and p(c k ), is often referred to as the generative approach since we are in effect modeling the joint distribution p(, c k ), rather than just the conditional p(c k ), allowing us in principle to generate or simulate data (eamples include Gaussian and/or naive Bayes classifiers). The class-conditional/discriminative approach tends to be much more widely used in practice than the generative approach. The primary reason for this is that modeling p( c) for high-dimensional is usually very difficult to do accurately, whereas modeling the class-conditional distributions p(c k ) can be much easier. The generative model has some advantages, however, compared to the class-conditional/discriminative model, particularly in problems where its useful to know something about the distribution of the data in the input space, such as dealing with missing data, semi-supervised learning, or detecting outliers and distributional shifts in the input space. Eample with Missing Data: Consider a prediction problem where we are predicting a binary variable y given a vector with a prediction model p(y = 1 ) = f(; θ) that we have learned from data, where is a d-dimensional real-valued vector. Assume at prediction time we are being presented with input vectors where some of the entries in are missing. Assume that in general different components of may be missing in a random fashion (e.g., imagine that for each vector there is a probability q of each component being missing in an independent fashion) 1. For eample for a particular the first m entries in the vector could be missing: = ( 1,..., m, m+1,..., d ) = ( m, o ) where m = 1,..., m (missing components) and o = ( m+1,..., d ) (observed components). How should we make a prediction for y given this partially observed vector and a prediction model f(; θ)? We can proceed as follows by conditioning on what is assumed to be known, i.e., conditioning on o : p(y = 1 m+1,..., d ) = p(y = 1 o ) = p(y = 1, m o )d m m = p(y = 1 m, o ) p( m o )d m m = p(y = 1 ) p( m o )d m m 1 We will implicitly assume that the data is missing at random which means that the patterns of missingness in the data do not depend on the actual values of the missing data. An eample of data that is not missing at random would be if large values for a variable were more likely to be missing than small values.
Notes on Discriminant Functions and Optimal Classification 6 = m f(; θ) p( m o )d m We see that the solution is to integrate over the different possible missing values m, and average over predictions f(; θ) with full/complete input vectors, weighted by p( m o ) in each case. These weights are conditional probabilities of the missing data given the observed data and in general the only way to compute these in the general case is to have a full model p() of the input vectors. In a generative model we will have such a distribution available, but in a discriminative/conditional model we will only have p(y ). Thus, the generative model has some advantages in situations like this but modeling p() is generally very difficult if is high-dimensional. We might ask the question what happens if we have missing data in the vectors at training time? How should we train such a model? A general technique to handle this, for generative models, is to use the Epectation-Maimization (E-M) algorithm, which is a general method for using maimum likelihood (or MAP estimation) in the presence of missing data. For discriminative models, the practical approach (if there are missing elements of the vectors in the training data) is typically to assume that the missing patterns in the training and test data are similar in nature, and to assign default values to the missing values in the same way in both the training and test data (e.g., assign these values to 0 or to the mean value of that feature in the training data). This is a heuristic and not optimal (in theory) but may work well if there is not too much missing data and may be the only practical approach when is high dimensional.