Lecture 9: Classification, LDA

Similar documents
Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

LDA, QDA, Naive Bayes

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Classification 2: Linear discriminant analysis (continued); logistic regression

Classification 1: Linear regression of indicators, linear discriminant analysis

Machine Learning Linear Classification. Prof. Matteo Matteucci

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Lecture 4 Discriminant Analysis, k-nearest Neighbors

CMSC858P Supervised Learning Methods

Linear Methods for Prediction

Introduction to Data Science

Machine Learning for OR & FE

Classification: Linear Discriminant Analysis

Classification. Chapter Introduction. 6.2 The Bayes classifier

Linear Regression and Discrimination

Linear Methods for Prediction

Applied Multivariate and Longitudinal Data Analysis

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Introduction to Machine Learning

LEC 4: Discriminant Analysis for Classification

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Statistical Data Mining and Machine Learning Hilary Term 2016

Linear Decision Boundaries

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Regularized Discriminant Analysis and Reduced-Rank LDA

Introduction to Machine Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Lecture 5: Classification

Introduction to Machine Learning Spring 2018 Note 18

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Machine Learning (CS 567) Lecture 5

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Lecture 5: LDA and Logistic Regression

How to evaluate credit scorecards - and why using the Gini coefficient has cost you money

The Bayes classifier

STA 450/4000 S: January

Spring 2006: Linear Discriminant Analysis, Etc.

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

MATH 567: Mathematical Techniques in Data Science Logistic regression and Discriminant Analysis

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

Statistical Machine Learning Hilary Term 2018

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Lecture 6: Methods for high-dimensional problems

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Machine Learning for Signal Processing Bayes Classification and Regression

12 Discriminant Analysis

Introduction to Logistic Regression

Performance Evaluation

Regularized Discriminant Analysis. Part I. Linear and Quadratic Discriminant Analysis. Discriminant Analysis. Example. Example. Class distribution

Week 5: Classification

SF2935: MODERN METHODS OF STATISTICAL LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS LEARNING. Tatjana Pavlenko.

MSA220 Statistical Learning for Big Data

Data Analytics for Social Science

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Introduction to Signal Detection and Classification. Phani Chavali

Lecture 6: Discrete Choice: Qualitative Response

CSE 546 Final Exam, Autumn 2013

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation requires to define performance measures to be optimized

Gaussian Models

Naive Bayes & Introduction to Gaussians

Lecture 6: Linear Regression (continued)

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

6.867 Machine Learning

Bayesian Decision Theory

Support Vector Machines

Bayesian Decision Theory

High Dimensional Discriminant Analysis

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Least Squares Classification

Multivariate statistical methods and data mining in particle physics

Machine Learning - MT Classification: Generative Models

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Generative Model (Naïve Bayes, LDA)

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Bayesian Decision and Bayesian Learning

Discriminant Analysis Documentation

CS 195-5: Machine Learning Problem Set 1

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Machine Learning, Fall 2012 Homework 2

Performance Evaluation and Comparison

SGN (4 cr) Chapter 5

Development and Evaluation of Methods for Predicting Protein Levels from Tandem Mass Spectrometry Data. Han Liu

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Support Vector Machines

High Dimensional Discriminant Analysis

MS&E 226: Small Data

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Transcription:

Lecture 9: Classification, LDA Reading: Chapter 4 STATS 202: Data mining and analysis October 13, 2017 1 / 21

Review: Main strategy in Chapter 4 Find an estimate ˆP (Y X). Then, given an input x 0, we predict the response as in a Bayes classifier: ŷ 0 = argmax y ˆP (Y = y X = x0 ). 2 / 21

Linear Discriminant Analysis (LDA) Instead of estimating P (Y X) directly, we will estimate: 3 / 21

Linear Discriminant Analysis (LDA) Instead of estimating P (Y X) directly, we will estimate: 1. ˆP (X Y ): Given the response, what is the distribution of the inputs? 3 / 21

Linear Discriminant Analysis (LDA) Instead of estimating P (Y X) directly, we will estimate: 1. ˆP (X Y ): Given the response, what is the distribution of the inputs? 2. ˆP (Y ): How probable is each category? 3 / 21

Linear Discriminant Analysis (LDA) Instead of estimating P (Y X) directly, we will estimate: 1. ˆP (X Y ): Given the response, what is the distribution of the inputs? 2. ˆP (Y ): How probable is each category? Then, we use Bayes rule to obtain the estimate: ˆP (Y = k X = x) = ˆP (X = x Y = k) ˆP (Y = k) ˆP (X = x) 3 / 21

Linear Discriminant Analysis (LDA) Instead of estimating P (Y X) directly, we will estimate: 1. ˆP (X Y ): Given the response, what is the distribution of the inputs? 2. ˆP (Y ): How probable is each category? Then, we use Bayes rule to obtain the estimate: ˆP (Y = k X = x) = ˆP (X = x Y = k) ˆP (Y = k) j ˆP (X = x Y = j) ˆP (Y = j) 3 / 21

Linear Discriminant Analysis (LDA) Instead of estimating P (Y X), we compute estimates: 1. ˆP (X = x Y = k) = ˆfk (x), where each ˆf k (x) is a Multivariate Normal Distribution density: X2 4 2 0 2 4 X2 4 2 0 2 4 4 2 0 2 4 X1 4 2 0 2 4 X1 4 / 21

Linear Discriminant Analysis (LDA) Instead of estimating P (Y X), we compute estimates: 1. ˆP (X = x Y = k) = ˆfk (x), where each ˆf k (x) is a Multivariate Normal Distribution density: X2 4 2 0 2 4 X2 4 2 0 2 4 4 2 0 2 4 X1 4 2 0 2 4 X1 2. ˆP (Y = k) = ˆπk is estimated by the fraction of training samples of class k. 4 / 21

LDA has linear decision boundaries Suppose that: 5 / 21

LDA has linear decision boundaries Suppose that: We know P (Y = k) = π k exactly. 5 / 21

LDA has linear decision boundaries Suppose that: We know P (Y = k) = π k exactly. P (X = x Y = k) is Mutivariate Normal with density: f k (x) = 1 (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) 5 / 21

LDA has linear decision boundaries Suppose that: We know P (Y = k) = π k exactly. P (X = x Y = k) is Mutivariate Normal with density: f k (x) = 1 (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) µ k : Mean of the inputs for category k. Σ : Covariance matrix (common to all categories). 5 / 21

LDA has linear decision boundaries Suppose that: We know P (Y = k) = π k exactly. P (X = x Y = k) is Mutivariate Normal with density: f k (x) = 1 (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) µ k : Mean of the inputs for category k. Σ : Covariance matrix (common to all categories). Then, what is the Bayes classifier? 5 / 21

LDA has linear decision boundaries By Bayes rule, the probability of category k, given the input x is: P (Y = k X = x) = likelihood prior marginal = f k(x)π k P (X = x) 6 / 21

LDA has linear decision boundaries By Bayes rule, the probability of category k, given the input x is: P (Y = k X = x) = likelihood prior marginal = f k(x)π k P (X = x) The denominator does not depend on the response k, so we can write it as a constant: P (Y = k X = x) = C f k (x)π k 6 / 21

LDA has linear decision boundaries By Bayes rule, the probability of category k, given the input x is: P (Y = k X = x) = likelihood prior marginal = f k(x)π k P (X = x) The denominator does not depend on the response k, so we can write it as a constant: P (Y = k X = x) = C f k (x)π k Plugging in the formula for f k (x): P (Y = k X = x) = Cπ k (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) 6 / 21

LDA has linear decision boundaries P (Y = k X = x) = Cπ k (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) 7 / 21

LDA has linear decision boundaries P (Y = k X = x) = Cπ k (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) Now, let us absorb everything that does not depend on k into a constant C : P (Y = k X = x) = C π k e 1 2 (x µ k) T Σ 1 (x µ k ) 7 / 21

LDA has linear decision boundaries P (Y = k X = x) = Cπ k (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) Now, let us absorb everything that does not depend on k into a constant C : P (Y = k X = x) = C π k e 1 2 (x µ k) T Σ 1 (x µ k ) and take the logarithm of both sides: log P (Y = k X = x) = log C + log π k 1 2 (x µ k) T Σ 1 (x µ k ). 7 / 21

LDA has linear decision boundaries P (Y = k X = x) = Cπ k (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) Now, let us absorb everything that does not depend on k into a constant C : P (Y = k X = x) = C π k e 1 2 (x µ k) T Σ 1 (x µ k ) and take the logarithm of both sides: log P (Y = k X = x) = log C + log π k 1 2 (x µ k) T Σ 1 (x µ k ). This constant is the same for every category, k. 7 / 21

LDA has linear decision boundaries P (Y = k X = x) = Cπ k (2π) p/2 Σ 1/2 e 1 2 (x µ k) T Σ 1 (x µ k ) Now, let us absorb everything that does not depend on k into a constant C : P (Y = k X = x) = C π k e 1 2 (x µ k) T Σ 1 (x µ k ) and take the logarithm of both sides: log P (Y = k X = x) = log C + log π k 1 2 (x µ k) T Σ 1 (x µ k ). This constant is the same for every category, k. So we want to find the maximum of this over k. 7 / 21

LDA has linear decision boundaries Goal, maximize the following over k: log π k 1 2 (x µ k) T Σ 1 (x µ k ). 8 / 21

LDA has linear decision boundaries Goal, maximize the following over k: log π k 1 2 (x µ k) T Σ 1 (x µ k ). = log π k 1 2 [ x T Σ 1 x + µ T k Σ 1 µ k ] + x T Σ 1 µ k 8 / 21

LDA has linear decision boundaries Goal, maximize the following over k: log π k 1 2 (x µ k) T Σ 1 (x µ k ). = log π k 1 2 [ x T Σ 1 x + µ T k Σ 1 µ k ] + x T Σ 1 µ k =C + log π k 1 2 µt k Σ 1 µ k + x T Σ 1 µ k 8 / 21

LDA has linear decision boundaries Goal, maximize the following over k: log π k 1 2 (x µ k) T Σ 1 (x µ k ). = log π k 1 2 [ x T Σ 1 x + µ T k Σ 1 µ k ] + x T Σ 1 µ k =C + log π k 1 2 µt k Σ 1 µ k + x T Σ 1 µ k We define the objective: δ k (x) = log π k 1 2 µt k Σ 1 µ k + x T Σ 1 µ k At an input x, we predict the response with the highest δ k (x). 8 / 21

LDA has linear decision boundaries What is the decision boundary? It is the set of points in which 2 classes are equally probable: δ k (x) = δ l (x) 9 / 21

LDA has linear decision boundaries What is the decision boundary? It is the set of points in which 2 classes are equally probable: δ k (x) = δ l (x) log π k 1 2 µt k Σ 1 µ k + x T Σ 1 µ k = log π l 1 2 µt l Σ 1 µ l + x T Σ 1 µ l 9 / 21

LDA has linear decision boundaries What is the decision boundary? It is the set of points in which 2 classes are equally probable: δ k (x) = δ l (x) log π k 1 2 µt k Σ 1 µ k + x T Σ 1 µ k = log π l 1 2 µt l Σ 1 µ l + x T Σ 1 µ l This is a linear equation in x. X2 4 2 0 2 4 X2 4 2 0 2 4 4 2 0 2 4 X1 4 2 0 2 4 X1 9 / 21

Estimating π k ˆπ k = #{i ; y i = k} n In words, the fraction of training samples of class k. 10 / 21

Estimating the parameters of f k (x) Estimate the the mean vector µ k for each class: ˆµ k = 1 #{i ; y i = k} i ; y i =k x i 11 / 21

Estimating the parameters of f k (x) Estimate the the mean vector µ k for each class: ˆµ k = 1 #{i ; y i = k} i ; y i =k Estimate the common covariance matrix Σ: x i 11 / 21

Estimating the parameters of f k (x) Estimate the the mean vector µ k for each class: ˆµ k = 1 #{i ; y i = k} i ; y i =k Estimate the common covariance matrix Σ: One predictor (p = 1): x i ˆσ 2 = 1 n K K k=1 i ; y i =k (x i ˆµ k ) 2. 11 / 21

Estimating the parameters of f k (x) Estimate the the mean vector µ k for each class: ˆµ k = 1 #{i ; y i = k} i ; y i =k Estimate the common covariance matrix Σ: One predictor (p = 1): x i ˆσ 2 = 1 n K K k=1 i ; y i =k (x i ˆµ k ) 2. Many predictors (p > 1): Compute the vectors of deviations (x 1 ˆµ y1 ), (x 2 ˆµ y2 ),..., (x n ˆµ yn ) and use an unbiased estimate of its covariance matrix, Σ. 11 / 21

LDA prediction For an input x, predict the class with the largest: ˆδ k (x) = log ˆπ k 1 2 ˆµT k ˆΣ 1ˆµ k + x T ˆΣ 1ˆµ k 12 / 21

LDA prediction For an input x, predict the class with the largest: ˆδ k (x) = log ˆπ k 1 2 ˆµT k ˆΣ 1ˆµ k + x T ˆΣ 1ˆµ k The decision boundaries are defined by: log ˆπ k 1 2 ˆµT k ˆΣ 1ˆµ k + x T ˆΣ 1ˆµ k = log ˆπ l 1 2 ˆµT l ˆΣ 1ˆµ l + x T ˆΣ 1ˆµ l 12 / 21

LDA prediction For an input x, predict the class with the largest: ˆδ k (x) = log ˆπ k 1 2 ˆµT k ˆΣ 1ˆµ k + x T ˆΣ 1ˆµ k The decision boundaries are defined by: log ˆπ k 1 2 ˆµT k ˆΣ 1ˆµ k + x T ˆΣ 1ˆµ k = log ˆπ l 1 2 ˆµT l ˆΣ 1ˆµ l + x T ˆΣ 1ˆµ l These are the solid lines in the following image: X2 4 2 0 2 4 X2 4 2 0 2 4 4 2 0 2 4 X1 4 2 0 2 4 X1 12 / 21

Quadratic discriminant analysis (QDA) The assumption that the inputs of every class have the same covariance Σ can be quite restrictive: X2 4 3 2 1 0 1 2 X2 4 3 2 1 0 1 2 4 2 0 2 4 4 2 0 2 4 X 1 Boundaries for Bayes (dashed), LDA (dotted), and QDA (solid). X 1 13 / 21

Quadratic discriminant analysis (QDA) In quadratic discriminant analysis we estimate a mean ˆµ k and a covariance matrix ˆΣ k for each class separately. 14 / 21

Quadratic discriminant analysis (QDA) In quadratic discriminant analysis we estimate a mean ˆµ k and a covariance matrix ˆΣ k for each class separately. Given an input, it is easy to derive an objective function: δ k (x) = log π k 1 2 µt k Σ 1 k µ k + x T Σ 1 k µ k 1 2 xt Σ 1 k x 1 2 log Σ k 14 / 21

Quadratic discriminant analysis (QDA) In quadratic discriminant analysis we estimate a mean ˆµ k and a covariance matrix ˆΣ k for each class separately. Given an input, it is easy to derive an objective function: δ k (x) = log π k 1 2 µt k Σ 1 k µ k + x T Σ 1 k µ k 1 2 xt Σ 1 k x 1 2 log Σ k This objective is now quadratic in x and so are the decision boundaries. 14 / 21

Quadratic discriminant analysis (QDA) Bayes boundary ( ) LDA ( ) QDA ( ). X2 4 3 2 1 0 1 2 X2 4 3 2 1 0 1 2 4 2 0 2 4 X 1 4 2 0 2 4 X 1 15 / 21

Evaluating a classification method We have talked about the 0-1 loss: 1 m m 1(y i ŷ i ). i=1 It is possible to make the wrong prediction for some classes more often than others. The 0-1 loss doesn t tell you anything about this. 16 / 21

Evaluating a classification method We have talked about the 0-1 loss: 1 m m 1(y i ŷ i ). i=1 It is possible to make the wrong prediction for some classes more often than others. The 0-1 loss doesn t tell you anything about this. A much more informative summary of the error is a confusion matrix: 16 / 21

Example. Predicting default Used LDA to predict credit card default in a dataset of 10K people. Predicted yes if P (default = yes X) > 0.5. 17 / 21

Example. Predicting default Used LDA to predict credit card default in a dataset of 10K people. Predicted yes if P (default = yes X) > 0.5. The error rate among people who do not default (false positive rate) is very low. 17 / 21

Example. Predicting default Used LDA to predict credit card default in a dataset of 10K people. Predicted yes if P (default = yes X) > 0.5. The error rate among people who do not default (false positive rate) is very low. However, the error rate among people who do default (false negative rate) is 76%. 17 / 21

Example. Predicting default Used LDA to predict credit card default in a dataset of 10K people. Predicted yes if P (default = yes X) > 0.5. The error rate among people who do not default (false positive rate) is very low. However, the error rate among people who do default (false negative rate) is 76%. False negatives may be a bigger source of concern! 17 / 21

Example. Predicting default Used LDA to predict credit card default in a dataset of 10K people. Predicted yes if P (default = yes X) > 0.5. The error rate among people who do not default (false positive rate) is very low. However, the error rate among people who do default (false negative rate) is 76%. False negatives may be a bigger source of concern! One possible solution: Change the threshold. 17 / 21

Example. Predicting default Changing the threshold to 0.2 makes it easier to classify to yes. Predicted yes if P (default = yes X) > 0.2. 18 / 21

Example. Predicting default Changing the threshold to 0.2 makes it easier to classify to yes. Predicted yes if P (default = yes X) > 0.2. Note that the rate of false positives became higher! That is the price to pay for fewer false negatives. 18 / 21

Example. Predicting default Let s visualize the dependence of the error on the threshold: Error Rate 0.0 0.2 0.4 0.6 0.0 0.1 0.2 0.3 0.4 0.5 Threshold False negative rate (error for defaulting customers) False positive rate (error for non-defaulting customers) 0-1 loss or total error rate. 19 / 21

Example. The ROC curve True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 ROC Curve Displays the performance of the method for any choice of threshold. 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate 20 / 21

Example. The ROC curve True positive rate 0.0 0.2 0.4 0.6 0.8 1.0 ROC Curve 0.0 0.2 0.4 0.6 0.8 1.0 False positive rate Displays the performance of the method for any choice of threshold. The area under the curve (AUC) measures the quality of the classifier: 0.5 is the AUC for a random classifier The closer AUC is to 1, the better. 20 / 21

Next time Comparison of logistic regression, LDA, QDA, and KNN classification. Start Chapter 5: Resampling. 21 / 21