Notes on Discriminant Functions and Optimal Classification

Similar documents
Machine Learning 4771

CS534 Machine Learning - Spring Final Exam

ECE521 week 3: 23/26 January 2017

6.867 Machine learning

CMSC858P Supervised Learning Methods

Machine Learning for Signal Processing Bayes Classification and Regression

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

7 Gaussian Discriminant Analysis (including QDA and LDA)

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to Machine Learning Midterm Exam

Machine Learning Practice Page 2 of 2 10/28/13

Naïve Bayes classification

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Linear Models for Classification

CMU-Q Lecture 24:

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

STA 414/2104, Spring 2014, Practice Problem Set #1

Week 5: Logistic Regression & Neural Networks

5. Discriminant analysis

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Support Vector Machines

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture 3: Pattern Classification. Pattern classification

Machine Learning Linear Classification. Prof. Matteo Matteucci

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Logistic Regression. Machine Learning Fall 2018

Introduction to Machine Learning Midterm Exam Solutions

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Bayesian Learning (II)

Learning with multiple models. Boosting.

Nonparametric Bayesian Methods (Gaussian Processes)

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction to Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

The Naïve Bayes Classifier. Machine Learning Fall 2017

Linear discriminant functions

PATTERN RECOGNITION AND MACHINE LEARNING

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

CSC321 Lecture 4 The Perceptron Algorithm

Introduction to Machine Learning

Machine Learning: Fisher s Linear Discriminant. Lecture 05

Artificial Neural Networks 2

Introduction to Machine Learning

Bias-Variance Tradeoff

Short Note: Naive Bayes Classifiers and Permanence of Ratios

18.9 SUPPORT VECTOR MACHINES

The Bayes classifier

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Generative Learning algorithms

Machine Learning Lecture 3

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Recap from previous lecture

Unsupervised Learning with Permuted Data

An Introduction to Statistical and Probabilistic Linear Models

Naive Bayes and Gaussian Bayes Classifier

Cheng Soon Ong & Christian Walder. Canberra February June 2018

FINAL: CS 6375 (Machine Learning) Fall 2014

Generative Clustering, Topic Modeling, & Bayesian Inference

Machine Learning

Qualifying Exam in Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

CSC 411: Lecture 09: Naive Bayes

MACHINE LEARNING ADVANCED MACHINE LEARNING

Warm up: risk prediction with logistic regression

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Introduction to Bayesian Learning. Machine Learning Fall 2018

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Pattern Recognition and Machine Learning

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Supervised Learning Coursework

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Linear & nonlinear classifiers

Ch 4. Linear Models for Classification

Classification 2: Linear discriminant analysis (continued); logistic regression

Naive Bayes and Gaussian Bayes Classifier

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Naive Bayes and Gaussian Bayes Classifier

Machine Learning 4771

12 Statistical Justifications; the Bias-Variance Decomposition

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

Today. Calculus. Linear Regression. Lagrange Multipliers

Parametric Techniques

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Regression I: Mean Squared Error and Measuring Quality of Fit

Linear Methods for Prediction

MLPR: Logistic Regression and Neural Networks

Least Squares Regression

Transcription:

Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem with a d-dimensional input vector and a class variable C taking values {1,..., K}. For simplicity you can consider the components of to all be real-valued. We can define a general form for any classifier as follows: For each class k {1,..., K} we have a discriminant function g k () that produces a scalar-valued discriminant value for each class k The classifier makes a decision on a new by computing the K discriminant values and assigning to the class with the largest value, i.e., k = arg ma g k () k 2 Decision Regions and Decision Boundaries We define the decision region R k for class k, 1 k K, as the region of the input space where the discriminant function for class k is larger than any of the other discriminant functions, i.e., if R k then is classified as being in class k by the classifier, or equivalently, g k () > g j (), j k. Note that a decision region need not be a single contiguous region in but could be the union of multiple disjoint subregions (depending on how the discriminant functions are defined). Decision boundaries between decision regions are defined by equality of the respective discriminant functions. Consider the two-class case for simplicity, with g 1 () and g 2 (). Points in the input space for which g 1 () = g 2 () are by definition on the decision boundary between the two classes. This is equivalent 1

Notes on Discriminant Functions and Optimal Classification 2 to saying that the equation for the decision boundary is defined by g 1 () g 2 () = 0. In fact for the two-class case it is clear that we don t need two discriminant functions, i.e., we can just define a single discriminant function g() and predict class 0 if g() < 0 and class 1 if g( > 0). Note that we have assumed here that is defined in a way such that one of its input is always set to 1 (or some constant), to ensure that we have an intercept term in our models. We define linear discriminants as g k () = θ T k, i.e., an inner product of a weight vector and the input vector, where θ k is a d-dimensional weight vector for class k. In general, linear discriminant functions will lead to piecewise linear decision regions in the input space. In particular, for the two-class case, the decision boundary equation is defined as g 1 () g 2 () = θ T 1 θ T 2 = θ T = 0, i.e., we only need a single weight vector θ. The equation θ T = 0 will in general define the equation of a (d 1)-dimensional hyperplane in the d-dimensional space, partitioning the input space into two contiguous decision regions separated by the linear hyperplane. Eamples of linear discriminants are perceptrons and logistic models. More generally, if the g k () are polynomial functions of order r in, the decision boundaries in the general case will also be polynomials of order r. An eample of a discriminant function that produces quadratic boundaries (r = 2) in the general case is the multivariate Gaussian classifier. And if the g k () are non-linear functions of then we will get non-linear decision boundaries (an eample would be a neural network with one or more hidden layers). 3 Optimal Discriminants Optimal discriminants are discriminant functions that minimize the error rate of a classifiers (here we assume that all misclassification errors incur equal cost more generally we can minimize epected cost where different errors may have different costs). It is not hard to see, that for minimizing the error rate that the optimal discriminant is defined as g k () = p(c k ), 1 k K which is equivalent to maimizing the posterior probability of the class variable C given, i.e., pick the most likely class for each. Note that this is optimal in theory, if we know the true p(c k ) eactly. In practice will usually need to approimate this connditional distribution by assuming some functional form for it (e.g., the logistic form) and learning the parameters for this functional from data. The assumption of a specific functional form may lead to bias (or approimation error) and learning parameters from a data set will lead to variance (or estimation error).

Notes on Discriminant Functions and Optimal Classification 3 From the definition of the optimal discriminant it is easy to see that another version of the optimal discriminant (in the sense that it will lead to the same decision for any ) is defined by g k () = p( c k )p(c k ) (because of Bayes rule) again, this is theoretically optimal if we know it, but in practice we usually do not. And similarly, any monotonic function of these discriminants, such as log p( c k ) + log p(c k ) are also optimal discriminants. 4 The Bayes Error Rate As mentioned above, the optimal discriminant for any classification problem is defined by g k () = p(c k ) (or some variant of this). Even though in practice we won t know the precise functional form or the parameters of p(c k ) it is nonetheless useful to look at the error rate that we would get with the optimal classifier. The error rate of the optimal classifier is known as the Bayes error rate and provide a lower bound on the performance of any actual classifier. As we will see below, the Bayes error rate depends on how much overlap there is between the density functions for each class, p( c k ), in the input space : if there is a lot of overlap we will get a high Bayes error rate, and with little or no overlap we get a low (near zero) Bayes error rate. Consider the error rate of the optimal classifier at some particular point in the input space. The probability of error is e = 1 p(c k ) where k = arg ma k {p(c k )}, i.e., since our optimal classifier will always select k given, the classifier will be correct p(c k ) fraction of the time and incorrect 1 p(c k ) fraction of the time, at. To compute the overall error rate (i.e., the probability that the optimal classifier will make an error on a random drawn from p()) we need to compute the epected value of the error rate with respect to p(): Pe = E p() [e ] = e p()d = ( ) 1 ma {p(c k )} p()d k We can rewrite this in terms of decision regions as P e = K ( ) (1 p(c k ))p()d R k k=1

Notes on Discriminant Functions and Optimal Classification 4 i.e., as the sum of K different error terms defined over the K decision regions. These error terms, one per class k, are proportional to how much overlap class k has with each of the other class densities. In the two-class case we can further simplify this to ( ) ( ) Pe = 1 p(c 1 ) p()d + 1 p(c 2 ) p()d R 1 R 2 = p(c 2 )p()d + p(c 1 )p()d R 1 R 2 Although we can only ever compute this for toy problems where we assume full knowledge of p(c k ) and p(), the concept of the Bayes error rate is nonetheless useful and it can be informative to see how it depends on density overlap. Note that any achievable classifier can never do better than the Bayes error rate in terms of its accuracy: the only way to improve on the Bayes error rate is to change the input feature vector, e.g., to add 1 or more features to the input space that can potentially separate the class densities more. So in principle it would seem as if it should always be a good idea to include as many features as we can in a classification problem, since the Bayes error rate of a higher dimensional space is always at least as low (if not lower) than any subset of dimensions of that space. This is true in theory, in terms of the optimal error rate in the higher-dimensional space: but in practice, when learning from a finite amount of data N, adding more features (more dimensions) means we need to learn more parameters, so our actual classifier in the higherdimensional space could in fact be less accurate (due to estimation noise) than a lower-dimensional classifier, even though the higher-dimensional space might have a lower Bayes error rate. In general, the actual error rate of a classifier can be thought of as having components similar to those for regression. In regresison, for a real-valued y and the squared error loss function, the error rate of a prediction model can be decomposed into a sum of the inherent variability of y given, plus a bias term, plus a variance term. For classification problems the decomposition does not have this simple additive form, but it is nonetheless useful to think of the actual error rate of a classifier as having components that come from (1) the Bayes error rate, (2) the bias (approimation error) of the classifier, and (3) variance (estimation error due to fitting the model to a finite amount of data). 5 Discriminative versus Generative Models The definitions above of optimal discriminants suggest two different strategies for learning classifiers from data. In the first we try to learn p(c k ) directly: this is referred to as the class-conditional or discriminative

Notes on Discriminant Functions and Optimal Classification 5 approach and eamples include logistic regression and neural networks. The second approach, where we learn p( c k ) and p(c k ), is often referred to as the generative approach since we are in effect modeling the joint distribution p(, c k ), rather than just the conditional p(c k ), allowing us in principle to generate or simulate data (eamples include Gaussian and/or naive Bayes classifiers). The class-conditional/discriminative approach tends to be much more widely used in practice than the generative approach. The primary reason for this is that modeling p( c) for high-dimensional is usually very difficult to do accurately, whereas modeling the class-conditional distributions p(c k ) can be much easier. The generative model has some advantages, however, compared to the class-conditional/discriminative model, particularly in problems where its useful to know something about the distribution of the data in the input space, such as dealing with missing data, semi-supervised learning, or detecting outliers and distributional shifts in the input space. Eample with Missing Data: Consider a prediction problem where we are predicting a binary variable y given a vector with a prediction model p(y = 1 ) = f(; θ) that we have learned from data, where is a d-dimensional real-valued vector. Assume at prediction time we are being presented with input vectors where some of the entries in are missing. Assume that in general different components of may be missing in a random fashion (e.g., imagine that for each vector there is a probability q of each component being missing in an independent fashion) 1. For eample for a particular the first m entries in the vector could be missing: = ( 1,..., m, m+1,..., d ) = ( m, o ) where m = 1,..., m (missing components) and o = ( m+1,..., d ) (observed components). How should we make a prediction for y given this partially observed vector and a prediction model f(; θ)? We can proceed as follows by conditioning on what is assumed to be known, i.e., conditioning on o : p(y = 1 m+1,..., d ) = p(y = 1 o ) = p(y = 1, m o )d m m = p(y = 1 m, o ) p( m o )d m m = p(y = 1 ) p( m o )d m m 1 We will implicitly assume that the data is missing at random which means that the patterns of missingness in the data do not depend on the actual values of the missing data. An eample of data that is not missing at random would be if large values for a variable were more likely to be missing than small values.

Notes on Discriminant Functions and Optimal Classification 6 = m f(; θ) p( m o )d m We see that the solution is to integrate over the different possible missing values m, and average over predictions f(; θ) with full/complete input vectors, weighted by p( m o ) in each case. These weights are conditional probabilities of the missing data given the observed data and in general the only way to compute these in the general case is to have a full model p() of the input vectors. In a generative model we will have such a distribution available, but in a discriminative/conditional model we will only have p(y ). Thus, the generative model has some advantages in situations like this but modeling p() is generally very difficult if is high-dimensional. We might ask the question what happens if we have missing data in the vectors at training time? How should we train such a model? A general technique to handle this, for generative models, is to use the Epectation-Maimization (E-M) algorithm, which is a general method for using maimum likelihood (or MAP estimation) in the presence of missing data. For discriminative models, the practical approach (if there are missing elements of the vectors in the training data) is typically to assume that the missing patterns in the training and test data are similar in nature, and to assign default values to the missing values in the same way in both the training and test data (e.g., assign these values to 0 or to the mean value of that feature in the training data). This is a heuristic and not optimal (in theory) but may work well if there is not too much missing data and may be the only practical approach when is high dimensional.