LEC 4: Discriminant Analysis for Classification

Similar documents
ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

LEC 5: Two Dimensional Linear Discriminant Analysis

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Classification Methods II: Linear and Quadratic Discrimminant Analysis

LEC1: Instance-based classifiers

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

LEC 3: Fisher Discriminant Analysis (FDA)

MLE/MAP + Naïve Bayes

Introduction to Machine Learning

MLE/MAP + Naïve Bayes

Introduction to Machine Learning

The Bayes classifier

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

CMSC858P Supervised Learning Methods

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Generative Model (Naïve Bayes, LDA)

Machine Learning: Assignment 1

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Introduction to Machine Learning

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Machine Learning (CS 567) Lecture 5

Regularized Discriminant Analysis and Reduced-Rank LDA

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Classification 1: Linear regression of indicators, linear discriminant analysis

Introduction to Machine Learning Spring 2018 Note 18

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Classification techniques focus on Discriminant Analysis

Lecture 5: LDA and Logistic Regression

Lecture 5: Classification

Linear Regression and Discrimination

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Computer Assignment 8 - Discriminant Analysis. 1 Linear Discriminant Analysis

Linear Decision Boundaries

Gaussian Models

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

6.867 Machine Learning

Linear Methods for Prediction

Introduction to Machine Learning

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

CS 195-5: Machine Learning Problem Set 1

MSA220 Statistical Learning for Big Data

SGN (4 cr) Chapter 5

LDA, QDA, Naive Bayes

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Classification 2: Linear discriminant analysis (continued); logistic regression

Multivariate statistical methods and data mining in particle physics

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Semiparametric Discriminant Analysis of Mixture Populations Using Mahalanobis Distance. Probal Chaudhuri and Subhajit Dutta

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

CS534 Machine Learning - Spring Final Exam

L11: Pattern recognition principles

MATH 567: Mathematical Techniques in Data Science Logistic regression and Discriminant Analysis

CPSC 540: Machine Learning

CS 6375 Machine Learning

Generative Clustering, Topic Modeling, & Bayesian Inference

Nonparametric Bayesian Methods (Gaussian Processes)

Creative Data Mining

5. Discriminant analysis

Classification. Chapter Introduction. 6.2 The Bayes classifier

Statistical Data Mining and Machine Learning Hilary Term 2016

Cheng Soon Ong & Christian Walder. Canberra February June 2018

An Introduction to Multivariate Methods

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Notes on Discriminant Functions and Optimal Classification

Statistics 202: Data Mining. c Jonathan Taylor. Week 6 Based in part on slides from textbook, slides of Susan Holmes. October 29, / 1

Statistical Machine Learning Hilary Term 2018

Data Mining and Analysis: Fundamental Concepts and Algorithms

Assignment 1: Probabilistic Reasoning, Maximum Likelihood, Classification

Machine Learning, Fall 2012 Homework 2

Classification: Linear Discriminant Analysis

Machine Learning 11. week

High Dimensional Discriminant Analysis

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Bias-Variance Tradeoff

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Section 7: Discriminant Analysis.

High Dimensional Discriminant Analysis

Linear Models. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

The Naïve Bayes Classifier. Machine Learning Fall 2017

Introduction to Bayesian Learning. Machine Learning Fall 2018

CPSC 340: Machine Learning and Data Mining

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Discriminant Analysis Documentation

Advanced Introduction to Machine Learning

Transcription:

LEC 4: Discriminant Analysis for Classification Dr. Guangliang Chen February 25, 2016

Outline Last time: FDA (dimensionality reduction) Today: QDA/LDA (classification) Naive Bayes classifiers Matlab/Python commands

Probabilistic models We introduce a mixture model to the training data: We model the distribution of each training class C i by a pdf f i (x). We assume that for a fraction π i of the time, x is sampled from C i. The Law of Total Probability implies that the mixture distribution has a pdf f(x) = f(x x C i )P (x C i ) = f i (x)π i that generates both training and test data (two independent samples from f(x)). We call π i = P (x C i ) the prior probabilities, i.e., probabilities that x C i prior to we see the sample. Dr. Guangliang Chen Mathematics & Statistics, San José State University 3/26

How to classify a new sample A naive way would be to assign a sample to the class with largest prior probability i = argmax i π i We don t know the true values of π i, so we ll estimate them using the observed training classes (in fact, only their sizes): ˆπ i = n i n, i This method makes constant prediction, with error rate 1 n i n. Is there a better way? Dr. Guangliang Chen Mathematics & Statistics, San José State University 4/26

Maximum A Posterior (MAP) classification A (much) better way is to assign the label based on the posterior probabilities (i.e., probabilities after we see the sample): i = argmax i P (x C i x) Bayes Rule tells us that the posterior probabilities are given by P (x C i x) = f(x x C i)p (x C i ) f(x) Therefore, the MAP classification rule can be stated as f i (x)π i This is also called Bayes classifier. i = argmax i f i (x)π i Dr. Guangliang Chen Mathematics & Statistics, San José State University 5/26

Estimating class-conditional probabilities f i (x) To estimate f i (x), we need to pick a model i.e., a distribution from certain family to represent each class. Different choices of the distribution lead to different classifiers: LDA/QDA: by using multivariate Gaussian distributions f i (x) = 1 (2π) d/2 Σ i 1/2 e 1 2 (x µ i) T Σ 1 (x µ i i ), Class i Naive Bayes: by assuming independent features in x = (x 1,..., x d ) f i (x) = d f ij (x j ) j=1 Dr. Guangliang Chen Mathematics & Statistics, San José State University 6/26

MAP classification with multivariate Gaussians In this case, we estimate the distribution means µ i and covariances Σ i using their sample counterparts (based on training data): ˆµ i = 1 x, and n ˆΣ i = 1 (x ˆµ i )(x ˆµ i ) T i n i 1 x C i x C i This leads to the following classifier: i = argmax i n i n(2π) d/2 ˆΣ i 1/2 e 1 2 (x ˆµi)T ˆΣ 1 (x ˆµ i i) = argmax i log n i 1 2 log ˆΣ i 1 2 (x ˆµ i) T ˆΣ 1 i (x ˆµ i ) Dr. Guangliang Chen Mathematics & Statistics, San José State University 7/26

Decision boundary The decision boundary of a classifier consists of points that have a tie. For the MAP classification rule based on mixture of Gaussians modeling, the decision boundaries are given by log n i 1 2 log ˆΣ i 1 2 (x ˆµ i) T ˆΣ 1 i (x ˆµ i ) = log n j 1 2 log ˆΣ j 1 2 (x ˆµ j) T ˆΣ 1 j (x ˆµ j ) This shows that the MAP classifier has quadratic boundaries. We call the above classifier Quadratic Discriminant Analysis (QDA). Dr. Guangliang Chen Mathematics & Statistics, San José State University 8/26

Equal covariance: A special case QDA assumes that each class distribution is multivariate Gaussian (but with its own center µ i and covariance Σ i ). We examine the special case when Σ 1 = = Σ c = Σ so that the different classes are shifted versions of each other. In this case, the MAP classification rule becomes i = argmax i log n i 1 2 (x ˆµ i) T ˆΣ 1 (x ˆµ i ) where ˆΣ represents the pooled estimate of Σ using all classes ˆΣ = 1 c (x ˆµ i )(x ˆµ i ) T n c i=1 x C i Dr. Guangliang Chen Mathematics & Statistics, San José State University 9/26

Decision boundary in the special case The decision boundary of the equal-covariance classifier is: log n i 1 2 (x ˆµ i) T ˆΣ 1 (x ˆµ i ) = log n j 1 2 (x ˆµ j) T ˆΣ 1 (x ˆµ j ) which simplifies to x T ˆΣ 1 (ˆµ i ˆµ j ) = log n i 1 (ˆµ T i n j 2 ˆΣ 1 ˆµ i ˆµ T ˆΣ ) j 1 ˆµ j This is a hyperplane with normal vector ˆΣ 1 (ˆµ i ˆµ j ), showing that the classifier has linear boundaries. We call it Linear Discriminant Analysis (LDA). Dr. Guangliang Chen Mathematics & Statistics, San José State University 10/26

Relationship between LDA and FDA The LDA boundaries are hyperplanes with normal vectors ˆΣ 1 (ˆµ i ˆµ j ). In 2-class FDA, the projection direction is v = S 1 w (ˆµ 1 ˆµ 2 ) where 2 S w = S 1 + S 2 = (x ˆµ i )(x ˆµ i ) T = (n 2)ˆΣ. i=1 x C i Therefore, LDA is essentially a union of 2-class FDAs (with cutoffs selected based on Bayes rule). However, they are derived from totally different perspectives (optimization versus probabilistic). Dr. Guangliang Chen Mathematics & Statistics, San José State University 11/26

Naive Bayes The naive Bayes classifier is also based on the MAP decision rule: i = argmax i f i (x)π i A simplifying assumption is made on the individual features of x: d f i (x) = f ij (x j ) j=1 (x 1,..., x d are independent) Accordingly, the decision rule becomes d d i = argmax i π i f ij (x j ) = argmax i log π i + log f ij (x j ) j=1 j=1 Dr. Guangliang Chen Mathematics & Statistics, San José State University 12/26

How to estimate f ij The independence assumption reduces the high dimensional density estimation problem (f i (x)) to a union of simple 1D problems ({f ij (x)} j ). Again, we need to pick a model for the f ij. For continuous features (which is the case in this course) the standard choice is the 1D normal distribution f ij (x) = 1 2πσij e (x µij)2 /2σ 2 ij where µ ij, σ ij can be estimated similarly using the training data. Dr. Guangliang Chen Mathematics & Statistics, San José State University 13/26

MAP classification: A summary General decision rule i = argmax i f i (x)π i Examples of Bayes classifiers QDA: multivariate Gaussians LDA: multivariate Gaussians with equal covariance Naive Bayes: independent features x 1,..., x d We will show some experiments with MATLAB (maybe also Python) next class. Dr. Guangliang Chen Mathematics & Statistics, San José State University 14/26

The Fisher Iris dataset Background (see Wikipedia) A typical test case for many statistical classification techniques in machine learning Originally used by Fisher for developing his linear discriminant model Data information 150 observations, with 50 samples from each of three species of Iris (setosa, virginica and versicolor) 4 features measured from each sample: the length and the width of the sepals and petals, in centimeters Dr. Guangliang Chen Mathematics & Statistics, San José State University 15/26

MATLAB implementation of LDA/QDA % fit a discriminant analysis classifier mdl = fitcdiscr(traindata, trainlabels, DiscrimType, type) % where type is one of the following: Linear (default): LDA Quadratic : QDA % classify new data pred = predict(mdl, testdata) Dr. Guangliang Chen Mathematics & Statistics, San José State University 16/26

Python scripts for LDA/QDA from sklearn.discriminant_analysis import LinearDiscriminantAnalysis #from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis lda = LinearDiscriminantAnalysis() pred = lda.fit(traindata,trainlabels).predict(testdata) print("number of mislabeled points: %d" %(testlabels!= pred).sum()) Dr. Guangliang Chen Mathematics & Statistics, San José State University 17/26

The singularity issue in LDA/QDA Both LDA and QDA require inverting covariance matrices, which may be singular in the case of high dimensional data. Common techniques to fix this: Apply PCA to reduce dimensionality first, or Regularize the covariance matrices, or Use psuedoinverse: pseudolinear, pseudoquadratic Dr. Guangliang Chen Mathematics & Statistics, San José State University 18/26

MATLAB functions for Naive Bayes % fit a naive Bayes classifier mdl = fitcnb(traindata, trainlabels, Distribution, normal ) % classify new data pred = predict(mdl, testdata) Dr. Guangliang Chen Mathematics & Statistics, San José State University 19/26

Python scripts for Naive Bayes from sklearn.naive_bayes import GaussianNB gnb = GaussianNB() pred = gnb.fit(traindata, trainlabels).predict(testdata) print("number of mislabeled points: %d" %(testlabels!= pred).sum()) Dr. Guangliang Chen Mathematics & Statistics, San José State University 20/26

Improving Naive Bayes Independence assumption: (closer to being independent) apply PCA to get uncorrelated features Choice of distribution: change normal to kernel smoothing to be more flexible mdl = fitcnb(traindata, trainlabels, Distribution, kernel ) However, this will be at the expense of speed. Dr. Guangliang Chen Mathematics & Statistics, San José State University 21/26

HW3a (due in 2 weeks) First use PCA to project the MNIST dataset into s dimensions and then do the following. 1. For each values of s below perform LDA on the data set and compare the errors you get: s = 154 (95% variance) s = 50 s = your own choice (preferably better than the above two) 2. Repeat Question 1 with QDA instead of LDA (everything else being the same). Dr. Guangliang Chen Mathematics & Statistics, San José State University 22/26

3. For each values of s below apply the Naive Bayes classifier (by fitting pixelwise normal distributions) to the data set and compare the errors you get: s = 784 (no projection) s = 154 (95% variance) s = 50 s = your own choice (preferably better than the above three) Dr. Guangliang Chen Mathematics & Statistics, San José State University 23/26

Next time: Two-dimensional LDA Dr. Guangliang Chen Mathematics & Statistics, San José State University 24/26

Midterm project 2: MAP classification Task: Concisely describe the classifiers we have learned in this part and summarize their corresponding results in a poster to be displayed in the classroom. In the meantime, you are encouraged to try the following ideas. The kernel smoothing option in Naive Bayes The cost option in LDA/QDA: mdl=fitcdiscr(traindata,trainlabels, Cost,COST) where COST is a square matrix, with COST(I,J) being the cost of classifying a point into class J if its true class is I. Default: COST(I,J)=1 if I =J, and COST(I,J)=0 if I=J. What else? Dr. Guangliang Chen Mathematics & Statistics, San José State University 25/26

Who can participate: One to two students from this class, subject to instructor s approval. When to finish: In 2 to 3 weeks. How it will be graded: Based on clarity, completeness, correctness, originality. Dr. Guangliang Chen Mathematics & Statistics, San José State University 26/26