Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Similar documents
Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Lecture 5: Classification

Performance Evaluation

Lecture 9: Classification, LDA

Bayesian Decision Theory

Lecture 9: Classification, LDA

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

The Bayes classifier

Machine Learning (CS 567) Lecture 5

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Lecture 9: Classification, LDA

Introduction to Machine Learning

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Part I. Linear regression & LASSO. Linear Regression. Linear Regression. Week 10 Based in part on slides from textbook, slides of Susan Holmes

Introduction to Machine Learning

Gaussian Models

Classification 2: Linear discriminant analysis (continued); logistic regression

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Performance Evaluation

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

Applied Multivariate and Longitudinal Data Analysis

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Statistics 202: Data Mining. c Jonathan Taylor. Week 6 Based in part on slides from textbook, slides of Susan Holmes. October 29, / 1

CPSC 540: Machine Learning

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

day month year documentname/initials 1

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

LEC 4: Discriminant Analysis for Classification

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Performance evaluation of binary classifiers

The SAS System 18:28 Saturday, March 10, Plot of Canonical Variables Identified by Cluster

L11: Pattern recognition principles

Bayesian Classification Methods

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Statistical Machine Learning Hilary Term 2018

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Classification 1: Linear regression of indicators, linear discriminant analysis

Feature Engineering, Model Evaluations

Machine Learning Concepts in Chemoinformatics

Generative Clustering, Topic Modeling, & Bayesian Inference

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Pattern Recognition 2

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Extensions to LDA and multinomial regression

Machine Learning - MT Classification: Generative Models

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Machine Learning for OR & FE

Metric-based classifiers. Nuno Vasconcelos UCSD

MULTIVARIATE HOMEWORK #5

Introduction to Machine Learning

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Principal component analysis

Linear Classifiers as Pattern Detectors

Data Mining and Analysis: Fundamental Concepts and Algorithms

1. Introduction to Multivariate Analysis

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

CMSC858P Supervised Learning Methods

PRINCIPAL COMPONENTS ANALYSIS

LDA, QDA, Naive Bayes

Classification using stochastic ensembles

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

5. Discriminant analysis

Stephen Scott.

Classification: Linear Discriminant Analysis

Classification. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Multivariate statistical methods and data mining in particle physics

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Bayesian Decision Theory

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

UVA CS 4501: Machine Learning

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

Linear Classifiers as Pattern Detectors

Machine Learning for Signal Processing Bayes Classification

Data Analytics for Social Science

Evaluation & Credibility Issues

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Distance Metric Learning

Introduction to Machine Learning

Transcription:

Week 5 Based in part on slides from textbook, slides of Susan Holmes Part I Linear Discriminant Analysis October 29, 2012 1 / 1 2 / 1 Nearest centroid rule Suppose we break down our data matrix as by the labels yielding (X j ) 1 j k with sizes nrow(x j ) = n j. A simple rule for classification is: Assign a new observation with features x to f (x) = argmin d(x, X j ) 1 j k Nearest centroid rule If we can assign a central point or centroid µ j to each X j, then we can define the distance above as distance to the centroid µ j. This yields the nearest centroid rule f (x) = argmin d(x, µ j ) 1 j k What do we mean by distance here? 3 / 1 4 / 1

Nearest centroid rule This rule is described completely by the functions Nearest centroid rule If d(x, y) = x y 2 then the natural centroid is h ij (x) = d(x, µ j ) d(x, µ i ) µ j = 1 n j n j l=1 X j l with f (x) being any i such that This rule just classifies points to the nearest µ j. h ij (x) 1 j. But, if the covariance matrix of our data is not I, this rule ignores structure in the data... 5 / 1 6 / 1 Why we should use covariance 20 15 10 5 Choice of distance Often, there is some background model for our data that is equivalent to a given procedure. For this nearest centroid rule, using the Euclidean distance effectively assumes that within the set of points X j, the rows are multivariate Gaussian with covariance matrix proportional to I. It also implicitly assumes that the n j s are roughly equal. 0 5 3 2 1 0 1 2 3 For instance, if one n j was very small just because a data point is close to that µ j doesn t necessarily mean that we should conclude it has label j because there might be a huge number of points of label i near that µ j. 7 / 1 8 / 1

Gaussian discriminant functions Suppose each group with label j had its own mean µ j and covariance matrix Σ j, as well as proportion π j. The Gaussian discriminant functions are defined as h ij (x) = h i (x) h j (x) h i (x) = log π i log Σ i /2 (x µ i ) T Σ 1 i (x µ i )/2 The first term weights the prior probability, the second two terms are log φ µi,σ i (x) = log L(µ i, Σ i x). Ignoring the first two terms, the h i s are essentially within-group Mahalanobis distances... Gaussian discriminant functions The classifier assigns x label i if h i (x) h j (x) j. Or, f (x) = argmax h i (x) 1 i k This is equivalent to a Bayesian rule. We ll see more Bayesian rules when we talk about naïve Bayes... When all Σ i and π i s are identical, the classifier is just nearest centroid using Mahalanobis distance instead of Euclidean distance. 9 / 1 10 / 1 Estimating discriminant functions In practice, we will have to estimate π j, µ j, Σ j. Obvious estimates: π j = n j k j=1 n j Estimating discriminant functions If we assume that the covariance matrix is the same within groups, then we might also form the pooled estimate Σ P = k j=1 (n j 1) Σ j k j=1 n j 1 µ j = 1 n j n j l=1 X j l Σ j = 1 n j 1 (X j µ j 1) T (X j µ j 1) If we use the pooled estimate Σ j = Σ P and plug these into the Gaussian discrimants, the functions h ij (x) are linear (or affine) functions of x. This is called Linear Discriminant Analysis (LDA). Not to be confused with the other LDA (Latent Dirichlet Allocation)... 11 / 1 12 / 1

Linear Discriminant Analysis using (petal.width, petal.length) Linear Discriminant Analysis using (sepal.width, sepal.length) 5.0 1.5 1.0 0.5 0.0 0.5 0 1 2 3 4 5 6 7 8 4.5 4.0 3.5 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 13 / 1 Quadratic Discriminant Analysis using (petal.width, petal.length) 14 / 1 Quadratic Discriminant Analysis If we use don t use pooled estimate Σ j = Σ j and plug these into the Gaussian discrimants, the functions h ij (x) are quadratic functions of x. This is called Quadratic Discriminant Analysis (QDA). 1.5 1.0 0.5 0.0 0.5 0 1 2 3 4 5 6 7 8 15 / 1 16 / 1

Quadratic Discriminant Analysis using (sepal.width, sepal.length) Motivation for Fisher s rule 5.0 20 4.5 15 4.0 10 3.5 5 0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 5 3 2 1 0 1 2 3 17 / 1 18 / 1 Fisher s discriminant function Fisher proposed to classify using a linear rule. He first decomposed Fisher s discriminant functions The direction v 1 is an eigenvector of some matrix. There are others, up to k 2 more. (X µ1) T (X µ1) = ŜS B + ŜS W Suppose we find all k 1 vectors and form X j V T, each one an n j (k 1) matrix with centroid η j R k 1. Then, he proposed, v = argmax v:v T ŜS W v=1 v T ŜS B v The matrix V determines a map from R p to R k 1, so given a new data point we can compute V x R k 1. This gives rise to a classifier Having found v, form X j v and the centroid η j = mean(x j v) In the two-class problem k = 2, this is the same as LDA. f (x) = nearest centroid(v x) This is LDA (assuming π j = 1 k )... 19 / 1 20 / 1

Discriminant models in general A discriminant model is generally a model that estimates P(Y = j x), 1 j k That is, given that the features I observe are x, the probability I think this label is j... LDA and QDA are actually generative models since they specify P(X = x Y = j). Logistic regression The logistic regression model is ubiquitous in binary classification (two-class) problems Model: P(Y = 1 x) = ext β 1 + e xt β = π(β, x) There are lots of discriminant models... 21 / 1 22 / 1 Logistic regression, setosa vs virginica, versicolor using (sepal.width, sepal.length) Logistic regression Software that fits a logistic regression model produces an estimate of β based on a data matrix X n p and binary labels Y n 1 {0, 1} n 5.0 4.5 It fits the model minimizing what we call the deviance 4.0 DEV(β) = 2 n (Y i log π(β, X i )+(1 Y i ) log(1 π(β, X i )) 3.5 i=1 While not immediately obvious, this is a convex minimization problem, hence is fairly easy to solve. Unlike trees, the convexity yields a globally optimal solution. 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 23 / 1 24 / 1

Logistic regression, virginica vs setosa, versicolor using (sepal.width, sepal.length) Logistic regression, versicolor vs setosa, virginica using (sepal.width, sepal.length) 5.0 5.0 4.5 4.5 4.0 4.0 3.5 3.5 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 25 / 1 26 / 1 Logistic regression Logistic regression produces an estimate of P(y = 1 x) = π(x) Typically, we classify as 1 if π(x) > 0.5. This yields a 2 2 confusion matrix Predicted: 0 Predicted: 1 Actual : 0 TN FP Actual : 1 FN TP From the 2 2 confusion matrix, we can compute Sensitivity, Specificity, etc. Logistic regression However, we could choose the threshold differently, perhaps related to estimates of the prior probabilities of 0 s and 1 s. Now, each threshold 0 t 1 yields a new confusion matrix This yields a 2 2 confusion matrix Predicted: 0 Predicted: 1 Actual : 0 TN(t) FP(t) Actual : 1 FN(t) TP(t) 27 / 1 28 / 1

AUC: Area under ROC curve ROC curve Generally speaking, we prefer classifiers that are both highly sensitive and highly specific. These confusion matrices can be summarized using an ROC (Receiver Operating Characteristic) curve. This is a plot of the curve (1 Specificity(t), Sensitivity(t)) 0 t 1 True Positive Rate 1.0 0.8 0.6 0.4 Often, Specificity is referred to as TNR (True Negative Rate), and (1 Specificity(t)) as FPR. Often, Sensitivity is referred to as TPR (True Positive Rate), and (1 Sensitivity(t)) as FNR. 0.2 0.0 ROC curve Random guessing t =0.5 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 29 / 1 30 / 1 ROC curve Points in the upper left of the ROC curve are good. Any point on the diagonal line represents a classifier that is guessing randomly. A tree classifier as we ve discussed seems to correspond to only one point in the ROC plot. But one can estimate probabilities based on frequencies in terminal nodes. ROC curve A common numeric summary of the ROC curve is AUC(ROC curve) = Area under ROC curve. Can be interpreted as an estimate of the probability that the classifier will give a random positive instance a higher score than a random negative instance. Maximum value is 1. For a random guesser, AUC is 0.5 31 / 1 32 / 1

AUC: Area under ROC curve ROC curve: logistic, rpart, lda 1.0 0.8 True Positive Rate 0.6 0.4 0.2 0.0 Logistic-Random Random 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate 33 / 1 34 / 1 Part II Overview General goals of data mining. Datatypes. Preprocessing & dimension reduction. Distances. Multidimensional scaling. Multidimensional arrays. Decision trees. Performance measures for classifiers.. 35 / 1 36 / 1

Datatypes Continuous, discrete, etc. General goals Definition: what is and isn t data mining. Different types of problems: Unsupervised problems. Supervised problems. Semi-supervised problems. How data is represented in R. Descriptive statistics for different datatypes. General characteristics: Is it observational or experimental? Is it very noisy? Is there spatial or temporal structure? Preprocessing General tasks: aggregation, transformation, discretization. Feature extraction: wavelet / FFT transform. Another type of feature extraction: dimension reduction. 37 / 1 38 / 1 Dimension reduction PCA as a dimension reduction tool. PCA in terms of the SVD. PCA loadings, scores. Graphical summaries Various plots of univariate continuous data: stem-leaf; histogram; density (using kernel estimate); ECDF; quantile. Boxplot. Pairs plot. Correlation / similarity matrix (cases or features) Multidimensional scaling Relation between distances and similarities. Graphical (Euclidean) representation of data that is closest to original dissimilarity. Relation to PCA when similarity is Euclidean. Multidimensional arrays Data cubes. Standard types of operations on data cubes. A few examples in R. 39 / 1 40 / 1

Decision trees Definition of a classifier. Specific form of a decision tree classifier. Applying the decision tree model. Fitting a decision tree using Hunt s algorithm. Various measures of impurity: Gini, entropy, Misclassification Rate. Pre and post-pruning to simplify tree. Performance measures Sensitivity, specificity, true/false negatives, true/false positives. Confusion matrix. Using a cost matrix to evaluate a classifier. Linear / Quadratic Discriminant Analysis; Logistic regression. ROC curve. 41 / 1 42 / 1