Lecture 4 Discriminant Analysis, k-nearest Neighbors

Similar documents
Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Lecture 3 Classification, Logistic Regression

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Lecture 9: Classification, LDA

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Applied Multivariate and Longitudinal Data Analysis

Stephen Scott.

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

The Bayes classifier

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Statistical Data Mining and Machine Learning Hilary Term 2016

Performance Evaluation and Comparison

Bayesian Decision Theory

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Machine Learning (CS 567) Lecture 5

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Introduction to Machine Learning

Classification. Chapter Introduction. 6.2 The Bayes classifier

Linear Decision Boundaries

Naïve Bayes classification

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

CSCI-567: Machine Learning (Spring 2019)

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

LDA, QDA, Naive Bayes

Performance Evaluation

Introduction to Machine Learning

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Classification using stochastic ensembles

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

CMSC858P Supervised Learning Methods

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Introduction to Signal Detection and Classification. Phani Chavali

Introduction to Machine Learning Spring 2018 Note 18

Evaluation. Andrea Passerini Machine Learning. Evaluation

Generative modeling of data. Instructor: Taylor Berg-Kirkpatrick Slides: Sanjoy Dasgupta

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Classification: Linear Discriminant Analysis

Evaluation requires to define performance measures to be optimized

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning and Data Mining. Linear classification. Kalev Kask

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Gaussian discriminant analysis Naive Bayes

Introduction to Data Science

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Classifier performance evaluation

Machine Learning Practice Page 2 of 2 10/28/13

Linear Methods for Classification

LECTURE NOTE #3 PROF. ALAN YUILLE

Linear Classifiers as Pattern Detectors

Outline Lecture 2 2(32)

Introduction to Machine Learning

Machine Learning - MT Classification: Generative Models

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

ECE521 week 3: 23/26 January 2017

Introduction to Machine Learning

Machine Learning 2017

Naive Bayes and Gaussian Bayes Classifier

The generative approach to classification. A classification problem. Generative models CSE 250B

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Naive Bayes and Gaussian Bayes Classifier

Statistical Machine Learning Hilary Term 2018

Machine Learning for Signal Processing Bayes Classification and Regression

Lecture 5: Classification

Machine Learning: A Statistics and Optimization Perspective

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Linear & nonlinear classifiers

Introduction to Machine Learning

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

CSC 411: Lecture 09: Naive Bayes

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Pointwise Exact Bootstrap Distributions of Cost Curves

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Anomaly Detection. Jing Gao. SUNY Buffalo

Introduction to Support Vector Machines

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

6.867 Machine Learning

Data Mining and Analysis: Fundamental Concepts and Algorithms

Advanced statistical methods for data analysis Lecture 2

Multivariate statistical methods and data mining in particle physics

Terminology for Statistical Data

Week 5: Classification

Machine Learning for Signal Processing Bayes Classification

Applied Machine Learning Annalisa Marsico

Introduction to Machine Learning Midterm Exam

Linear Regression and Discrimination

ESS2222. Lecture 4 Linear model

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Transcription:

Lecture 4 Discriminant Analysis, k-nearest Neighbors Fredrik Lindsten Division of Systems and Control Department of Information Technology Uppsala University. Email: fredrik.lindsten@it.uu.se fredrik.lindsten@it.uu.se

Summary of Lecture 3 (I/VI) The classification problem amounts to modeling the relationship between the input X and a qualitative output Y, i.e., the output belongs to one out of K distinct classes. A classifier is a prediction model Ŷ = Ĝ(X) that maps any input X into a predicted class Ŷ {1,..., K}. The Bayes classifier predicts each input as belonging to the most likely class, according to the conditional probabilities Pr(Y = k X) for k {1,..., K}. The Bayes classifier is optimal w.r.t. misclassification error. 1 / 29 fredrik.lindsten@it.uu.se

Summary of Lecture 3 (II/VI) For binary (two-class) classification, Y {, 1}, the logistic regression model is q(x) def = Pr(Y = 1 X) = eβt X 1 + e βt X. The model parameters β are found by maximum likelihood by (numerically) maximizing the log-likelihood function, log l(β) = n i=1 ( )) y i β T x i log (1 + e βt x i. 2 / 29 fredrik.lindsten@it.uu.se

Summary of Lecture 3 (III/VI) Approximating the Bayes classifier, we get the prediction model { 1 if q(x) >.5 Ĝ(X) = β T X > otherwise This attempts to minimize the total misclassification error. More generally, we can use { 1 if q(x) > r, Ĝ(X) = otherwise where r 1 is a user chosen threshold. 3 / 29 fredrik.lindsten@it.uu.se

Summary of Lecture 3 (IV/VI) Confusion matrix: Predicted condition Ŷ = Ŷ = 1 Total True Y = TN FP N condition Y = 1 FN TP P Total N* P* For the classifier Ĝ(X) = I( q(x) > r ) the numbers in the confusion matrix will depend on the threshold r. Decreasing r TN, FN decrease and FP, TP increase. Increasing r TN, FN increase and FP, TP decrease. 4 / 29 fredrik.lindsten@it.uu.se

Summary of Lecture 3 (V/VI) Some commonly used derived performance measures are: True positive rate: TPR = TP P = TP [, 1] FN + TP False positive rate: FPR = FP N = FP [, 1] FP + TN Precision: Prec = TP P* = TP [, 1] FP + TP 5 / 29 fredrik.lindsten@it.uu.se

True positive rate Summary of Lecture 3 (VI/VI) 1 ROC curve for logistic regression.8.6.4.2.2.4.6.8 1 False positive rate ROC: plot of TPR vs. FPR as r ranges from to 1.1 Area Under Curve (AUC): condensed performance measure for the classifier, taking all possible thresholds into account. 1Possible to draw ROC curves based on other tuning parameters as well. 6 / 29 fredrik.lindsten@it.uu.se

Contents Lecture 4 1. Linear Discriminant Analysis 2. Quadratic Discriminant Analysis 3. A nonparametric classifier k-nearest Neighbors 7 / 29 fredrik.lindsten@it.uu.se

Bayes theorem Figure: (Possibly) Thomas Bayes, c. 171 1761. Bayes theorem describes a fundamental relationship between conditional and marginal probabilities. p(z 1 z 2 ) = p(z 2 z 1 )p(z 1 ) p(z 2 ) = p(z 2 z 1 )p(z 1 ) p(z2 z 1 )p(z 1 )dz 2 8 / 29 fredrik.lindsten@it.uu.se

Bayes theorem for classification Recall the Bayes classifier, based on the conditional probabilities Pr(Y = k X = x). Let π k def = Pr(Y = k) denote the prior (marginal) probability of class k {1,..., K}. f k (x) def = p(x Y = k) denote the probability density of X for an observation that comes from the kth class. The Bayes classifier can then be expressed using Bayes theorem: Pr(Y = k X = x) = f k (x)π k Kj=1 f j (x)π j. 9 / 29 fredrik.lindsten@it.uu.se

Multivariate Gaussian density The p-dimensional Gaussian (normal) probability density function with mean vector µ and covariance matrix Σ is, ( 1 N (x µ, Σ) = (2π) p/2 exp 1 ) Σ 1/2 2 (x µ)t Σ 1 (x µ), where µ : p 1 vector and Σ : p p positive definite matrix. Let X = (X 1,..., X p ) T N (µ, Σ), µ j is the mean of X j, Σ jj is the variance of X j, Σ ij (i j) is the covariance between X i and X j. 1 / 29 fredrik.lindsten@it.uu.se

Multivariate Gaussian density.2.3 Value of PDF.15.1.5 Value of PDF.2.1 5 5 5 5 x 2 Σ = -5-5 ( ) 1 1 x 1 x 2 Σ = -5-5 x 1 ( 1 ).4.4.5 11 / 29 fredrik.lindsten@it.uu.se

LDA, summary The LDA classifier assigns a test input X = x to class k for which, is largest, where δ k (x) = x T Σ 1 µ k 1 2 µt k Σ 1 µ k + log π k π k = n k /n k = 1,..., K, µ k = 1 n k x i, k = 1,..., K, Σ = 1 n K i:y i =k K k=1 i:y i =k (x i µ k )(x i µ k ) T. δ k (x) is a linear function in x and LDA is therefore a linear classifier (its decision boundaries are linear). 12 / 29 fredrik.lindsten@it.uu.se

ex) LDA decision boundary Illustration of LDA decision boundary the level curves of two Gaussian PDFs with the same covariance matrix intersect along a straight line, δ 1 (x) = δ 2 (x). x 2 x 1 13 / 29 fredrik.lindsten@it.uu.se

ex) Simple spam filter We will use LDA to construct a simple spam filter: Output: Y {spam, ham} Input: X = 57-dimensional vector of features extracted from the email Frequencies of 48 predefined words (make, address, all,... ) Frequencies of 6 predefined characters (;, (, [,!, $, #) Average length of uninterrupted sequences of capital letters Length of longest uninterrupted sequence of capital letters Total number of capital letters in the e-mail Dataset consists of 4, 61 emails classified as either spam or ham (split into 75 % training and 25 % testing) UCI Machine Learning Repository Spambase Dataset (https://archive.ics.uci.edu/ml/datasets/spambase) 14 / 29 fredrik.lindsten@it.uu.se

ex) Simple spam filter > lda.fit = lda(y.,data=dataset,subset=-test) LDA confusion matrix (test data): Ŷ = Ŷ = 1 Y = 667 23 Y = 1 12 358 Table: Threshold r =.5 Ŷ = Ŷ = 1 Y = 687 3 Y = 1 237 223 Table: Threshold r =.9 15 / 29 fredrik.lindsten@it.uu.se

ex) Simple spam filter A linear classifier uses a linear decision boundary for separating the two classes as much as possible. In R p this is a (p 1)-dimensional hyperplane. How can we visualize the classifier? One way: project the (labeled) test inputs onto the normal of the decision boundary. 2 15 Y= Y=1 1 5 16 / 29 fredrik.lindsten@it.uu.se -.5.5 Projection on normal

ex) Simple spam filter Feature Count Input (X) make X 1 = address X 2 = all 1 X 3 =.48... char ; 1 X 49 =.7 char ( 3 X 5 =.21 char [ X 51 = char! 2 X 52 =.14... #Capitals 4 X 57 = 4 Pr(ham X) = 96.4% (according to LDA model) 17 / 29 fredrik.lindsten@it.uu.se

Gmail s spam filter Official Gmail blog, July 9, 215.... the spam filter now uses an artificial neural network to detect and block the especially sneaky spam the kind that could actually pass for wanted mail. Sri Harsha Somanchi, Product Manager 18 / 29 fredrik.lindsten@it.uu.se

Quadratic discriminant analysis Do we have to assume a common covariance matrix? No! Estimating a separate covariance matrix for each class leads to an alternative method, Quadratic Discriminant Analysis (QDA). Whether we should choose LDA or QDA has to do with the bias-variance trade-off, i.e. the risk of over- or underfitting. Compared to LDA, QDA... has more parameters is more flexible (lower bias) has higher risk of overfitting (larger variance) 19 / 29 fredrik.lindsten@it.uu.se

ex) QDA decision boundary Illustration of QDA decision boundary the level curves of two Gaussian PDFs with different covariance matrices intersect along a curved (quadratic) line. x 2 x 1 Thus, QDA is not a linear classifier. 2 / 29 fredrik.lindsten@it.uu.se

Parametric and nonparametric models So far we have looked at a few parametric models, linear regression, logistic regression, LDA and QDA, all of which are parametrized by a fixed-dimensional parameter. Non-parametric models instead allow the flexibility of the model to grow with the amount of available data. can be very flexible (= low bias) can suffer from overfitting (high variance) can be compute and memory intensive As always, the bias-variance trade-off is key also when working with non-parametric models. 21 / 29 fredrik.lindsten@it.uu.se

k-nn The k-nearest neighbors (k-nn) classifier is a simple non-parametric method. Given training data T = {(x i, y i )} n i=1, for a test input X = x, 1. Identify the k training inputs x i nearest to x, represented by the set N = {i : x i is in the k-neighborhood of x}. 2. Classify x according to a majority vote within the neighborhood N, i.e. assign x to class j for which i N I(y i = j) is largest (ties are often handled by a coin flip). 22 / 29 fredrik.lindsten@it.uu.se

ex) k-nn on a toy model We illustrate the k-nn classifier on a synthetic example where the Bayes classifier is known (colored regions). 4 2 Y=1 Y=2 Y=3 X 2-2 -4-6 -4-2 2 4 X 1 23 / 29 fredrik.lindsten@it.uu.se

ex) k-nn on a toy model The decision boundaries for the k-nn classifier are shown as black lines. 4 2 Y=1 Y=2 Y=3 k=1 X 2-2 -4-6 -4-2 2 4 X 1 24 / 29 fredrik.lindsten@it.uu.se

ex) k-nn on a toy model The decision boundaries for the k-nn classifier are shown as black lines. 4 2 Y=1 Y=2 Y=3 k=1 X 2-2 -4-6 -4-2 2 4 X 1 25 / 29 fredrik.lindsten@it.uu.se

ex) k-nn on a toy model The decision boundaries for the k-nn classifier are shown as black lines. 4 2 Y=1 Y=2 Y=3 k=1 X 2-2 -4-6 -4-2 2 4 X 1 26 / 29 fredrik.lindsten@it.uu.se

ex) k-nn on a toy model 4 k=1 4 k=1 4 k=1 Y=1 Y=1 Y=1 Y=2 Y=2 Y=2 2 Y=3 2 Y=3 2 Y=3 X 2 X 2 X 2-2 -2-2 -4-4 -4-6 -4-2 2 4 X 1-6 -4-2 2 4 X 1-6 -4-2 2 4 X 1 The choice of the tuning parameter k controls the model flexibility: Small k small bias, large variance Large k large bias, small variance 27 / 29 fredrik.lindsten@it.uu.se

Error ex) k-nn on a toy model.4.3 Training error Test error Bayes error.2.1.1.1 1 1/k 28 / 29 fredrik.lindsten@it.uu.se

A few concepts to summarize lecture 4 Bayes theorem: A formula relating conditional and marginal probabilities of two events. Linear discriminant analysis (LDA): A classifier based on the assumption that the distribution of the input X is multivariate Gaussian for each class, with different means but the same covariance. LDA is a linear classifier. Quadratic discriminant analysis (QDA): Same as LDA, but where a different covariance matrix is used for each class. QDA is not a linear classifier. Non-parametric model: A model where the flexibility is allowed to grow with the amount of available training data. k-nn classifier: A simple non-parametric classifier based on classifying a test input X = x according to the class labels of the k training samples nearest to x. 29 / 29 fredrik.lindsten@it.uu.se