Section 7: Discriminant Analysis.

Similar documents
ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Classification: Linear Discriminant Analysis

Classification 1: Linear regression of indicators, linear discriminant analysis

MSA220 Statistical Learning for Big Data

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Discriminant Analysis Documentation

Lecture 9: Classification, LDA

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

LEC 4: Discriminant Analysis for Classification

Introduction to Machine Learning

Introduction to Machine Learning

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Computer Assignment 8 - Discriminant Analysis. 1 Linear Discriminant Analysis

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

The Bayes classifier

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Statistical Machine Learning Hilary Term 2018

Train the model with a subset of the data. Test the model on the remaining data (the validation set) What data to choose for training vs. test?

Getting to the Roots of Quadratics

Linear Decision Boundaries

Motivating the Covariance Matrix

Classification 2: Linear discriminant analysis (continued); logistic regression

LDA, QDA, Naive Bayes

The following postestimation commands are of special interest after discrim qda: The following standard postestimation commands are also available:

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning 2nd Edition

Statistical aspects of prediction models with high-dimensional data

SV6: Polynomial Regression and Neural Networks

Introduction to Machine Learning

Manual for a computer class in ML

VC Dimension Review. The purpose of this document is to review VC dimension and PAC learning for infinite hypothesis spaces.

CS798: Selected topics in Machine Learning

CMSC858P Supervised Learning Methods

MULTIVARIATE PATTERN RECOGNITION FOR CHEMOMETRICS. Richard Brereton

Model Assessment and Selection: Exercises

Creative Data Mining

SF2935: MODERN METHODS OF STATISTICAL LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS LEARNING. Tatjana Pavlenko.

Lecture 5: LDA and Logistic Regression

Regularized Discriminant Analysis. Part I. Linear and Quadratic Discriminant Analysis. Discriminant Analysis. Example. Example. Class distribution

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Regularized Discriminant Analysis and Reduced-Rank LDA

Introduction to Machine Learning Spring 2018 Note 18

Classification using stochastic ensembles

Is cross-validation valid for small-sample microarray classification?

Classification Techniques with Applications in Remote Sensing

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 9: June 6, Abstract. Regression and R.

A Brief Introduction To. GRTensor. On MAPLE Platform. A write-up for the presentation delivered on the same topic as a part of the course PHYS 601

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Computation. For QDA we need to calculate: Lets first consider the case that

Machine Learning 11. week

Machine Learning (CS 567) Lecture 5

Life Cycle of Stars. Photometry of star clusters with SalsaJ. Authors: Daniel Duggan & Sarah Roberts

An introduction to plotting data

Analysing categorical data using logit models

Lecture 6: Methods for high-dimensional problems

Search for the Dubai in the remap search bar:

Linear vs Non-linear classifier. CS789: Machine Learning and Neural Network. Introduction

Cellwise robust regularized discriminant analysis

Spring 2006: Linear Discriminant Analysis, Etc.

Cellwise robust regularized discriminant analysis

Chapter 9 Ingredients of Multivariable Change: Models, Graphs, Rates

High Dimensional Discriminant Analysis

Search for the Gulf of Carpentaria in the remap search bar:

ClassCast: A Tool for Class-Based Forecasting

Lecture 13: Data Modelling and Distributions. Intelligent Data Analysis and Probabilistic Inference Lecture 13 Slide No 1

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2012, Mr. Ruey S. Tsay

TDT4173 Machine Learning

Experiment 1: Linear Regression

DIFFERENTIATION AND INTEGRATION PART 1. Mr C s IB Standard Notes

From dummy regression to prior probabilities in PLS-DA

FA Homework 5 Recitation 1. Easwaran Ramamurthy Guoquan (GQ) Zhao Logan Brooks

A Study of Relative Efficiency and Robustness of Classification Methods

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

OECD QSAR Toolbox v.3.3. Step-by-step example of how to build a userdefined

Classification techniques focus on Discriminant Analysis

Introduction to Machine Learning

12 Discriminant Analysis

Notes on Discriminant Functions and Optimal Classification

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

L5: Quadratic classifiers

TImath.com Algebra 2

CLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS

Kernels and the Kernel Trick. Machine Learning Fall 2017

Displaying and Rotating WindNinja-Derived Wind Vectors in ArcMap 10.5

Search for the Dubai in the remap search bar:

CSCE 155N Fall Homework Assignment 2: Stress-Strain Curve. Assigned: September 11, 2012 Due: October 02, 2012

Section 29: What s an Inverse?

Classroom Tips and Techniques: Series Expansions Robert J. Lopez Emeritus Professor of Mathematics and Maple Fellow Maplesoft

Chapter 19: Logistic regression

A Simple Implementation of the Stochastic Discrimination for Pattern Recognition

MAC 1140: Test 1 Review, Fall 2017 Exam covers Lectures 1 5, Sections A.1 A.5. II. distance between a and b on the number line is d(a, b) = b a

Solutions Assignment 2 - Growth and Easy ModelWorks

High Dimensional Discriminant Analysis

Gaussian Models

18.9 SUPPORT VECTOR MACHINES

Software BioScout-Calibrator June 2013

Transcription:

Section 7: Discriminant Analysis. Ensure you have completed all previous worksheets before advancing. 1 Linear Discriminant Analysis To perform Linear Discriminant Analysis in R we will make use of the lda function in the package MASS. Load this package and have a look at the function s help file (we will also need the package MASS in order to use the qda function for Quadratic Discriminant Analysis. Remember from lectures that LDA assumes the data within each group follows a Multivariate Normal distribution and that each such group distribution shares the same covariance matrix. Task: Download the Salmon data that is available at: http://www.cs.tcd.ie/~houldinb/index/mva.html Save this file into R under the name salmon using the read.table command. A plot of this data is available by entering the following: plot(salmon[,-1],col=as.factor(salmon[,1] In order to use LDA, we need to first split the data into a part used to train the classifier, and another part that will be used to test the classifier. For this example we will try an 80:20 split (we will again ignore the validation step for the purposes of exposition: strain=salmon[c(1:40,51:90,] stest=salmon[c(41:50,91:100,] We are then able to train our classifier in the following way: lsol=lda(strain[,c(2,3],grouping=strain[,1] The above has told R to train the classifier on the data available in the second and third columns of the data set strain using the knowledge that this data is grouped in the way described by the first column of strain. 1

If you enter the following into R you will be returned with a list of summary information concerning the computation: lsol$prior Alaska Canada 0.5 0.5 lsol$means Group means: Freshwater Marine Alaska 100.550 422.275 Canada 138.625 368.650 The first of these provides the prior probability of group membership that is used in the analysis, which by default, is taken to be the class proportions in the training data. The second section provides the estimated group means for each of the two groups (which is no different from that which would be found by applying the mean function on the appropriate subset. Additional information that is not provided, but may be important, is the single covariance matrix that is being used for the various groupings. You can find this manually by calculating the following: Σ = (N 1 1Σ 1 + (N 2 1Σ 2 N 1 + N 2 2 In the above Σ is the estimated common covariance matrix, Σ i is the estimated covariance matrix for specific group i, whilst N i is the number of data points in group i. This formula extends naturally for the case of k groups: k i=1 (N i 1Σ i Σ = ( k i=1 N i k To request the covariance matrix in R for the case of the salmon data enter the following: alaskasalmon=salmon[c(1:40,c(2,3] canadasalmon=salmon[c(51:90,c(2,3] singlecov=(39/78*(cov(alaskasalmon+cov(canadasalmon Remember from lectures that the classification rule for LDA is as follows: ( P(k x ( πk ( f(x k log = log + log = 0 P(l x π l f(x l 2

The Multivariate Normal assumption with common covariance then leads to the following: ( f(x k log = x T Σ 1 (µ k µ l 1/2(µ T k Σ 1 µ k µ T l Σ 1 µ l f(x l log ( f(x k = (x 1/2(µ k + µ l T Σ 1 (µ k µ l f(x l The term 1/2(µ k +µ l gives the average of the group means, and so x 1/2(µ k +µ l gives the difference of the observation to this value. Assuming the prior probabilities are equal, (x 1/2(µ k + µ l T Σ 1 (µ k µ l determines the classification by whether it is positive or negative (if positive then group k. Calling lsol in the R console also provides additional information, in particular: Coefficients of linear discriminants: LD1 Freshwater 0.04390178 Marine -0.01806237 These are a (scaled version of Σ 1 (µ k µ l, and hence can be used for classifying a new observation (though you need to note that it is the second class, here Canada, that is associated with group k. Task: Determine the classification for an observation with a Freshwater recording of 120 and a Marine recording of 380. To automatically predict such an observation enter the following: predict(lsol,c(120,380 To automatically predict the test data enter the following: predict(lsol,stest[,c(2,3] Remember that when π k π l, the term log(π k /π l no longer disappears (though we can still use the predict function. Of course we can always classify by hand by assigning to group k that maximises: log(π k + (x 1/2(µ k T Σ 1 (µ k 2 Cross-Validation Rather than splitting the data into a training and testing split (or a training, validation, and testing split for the case that different models are being considered, e.g., if 3

both LDA and QDA are being considered, an alternative technique for measuring the performance of the model is to ask R to perform cross-validation. For the lda function this is achieved by incorporating the argument CV=TRUE: lsolcv=lda(salmon[,c(2,3],grouping=salmon[,1],cv=true You should now find that the output from entering lsolcv includes a list of how the data points were classified when they were the point left out. Additionally a vector of group membership probability is returned. In order to visualise the performance of LDA under cross-validation we can request a plot of the following form: plot(salmon[,c(2,3],col=as.factor(salmon[,1],pch=as.numeric(lsolcv$class The above command plots the two numeric variables of the salmon data with colouring being determined by true classification and symbols being determined by the resulting classification of leave-one-out LDA. 3 Quadratic Discriminant Analysis The function qda within the package MASS performs Quadratic Discriminant Analysis within R. Remember the difference between QDA and LDA is that the former permits each group distribution to have its own covariance matrix, whilst the latter assumes a common covariance matrix for all group distributions. The usage of qda within R is just the same as the usage of lda, e.g.: qsol=qda(strain[,c(2,3],grouping=strain[,1] predict(qsol,stest[,c(2,3] Again you will notice in an 80:20 training:testing split we have achieved 100% correct classification. The output returned from now entering qsol provides details of the prior probability of group membership (again determined by the proportion of data points classified in that group by default, and the mean vectors for each group. To find the covariances for the two groups enter the following: alaskasalmon=salmon[c(1:40,c(2,3] cov(alaskasalmon canadasalmon=salmon[c(51:90,c(2,3] cov(canadasalmon Task: Assess the performance of QDA for the salmon data set under cross-validation and produce a plot of your results: 4

Task: Assess the performance of the best of LDA and QDA under a 50:25:25 training:validation:testing split of the salmon data (remember that this means you should use 50% of the data to train the models, 25% of the data to assess which trained model appears to be the better classifier, and a further 25% of the data to more accurately assess the true classification rate of the better model: Exercise: Assess the performance of LDA and QDA for classifying the iris data by species. Exercise: Assess the performance of LDA and QDA for classifying the olive data by either region of origin, or specific area of origin. Finally, to do something a little more fancy using the ellipse function in package ellipse (can you understand what it is doing?: plot(salmon[,c(2,3],col=as.factor(salmon[,1],xlim=c(50,190, + ylim=c(290,530 lines(ellipse(cov(salmon[c(1:50,c(2,3], + centre=mean(salmon[c(1:50,c(2,3],level=0.5 lines(ellipse(cov(salmon[c(51:100,c(2,3], + centre=mean(salmon[c(51:100,c(2,3],level=0.5,col=2 5