Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

Similar documents
Bayes Decision Theory

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Linear Classification: Probabilistic Generative Models

Multi-layer Neural Networks

Bayesian Decision Theory

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Linear Models for Classification

COM336: Neural Computing

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

5. Discriminant analysis

Classification. 1. Strategies for classification 2. Minimizing the probability for misclassification 3. Risk minimization

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Machine Learning. 7. Logistic and Linear Regression

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

L5: Quadratic classifiers

PATTERN RECOGNITION AND MACHINE LEARNING

The Gaussian distribution

Inf2b Learning and Data

CS340 Machine learning Gaussian classifiers

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Inf2b Learning and Data

CSC 411: Lecture 09: Naive Bayes

Linear Decision Boundaries

Pattern recognition. "To understand is to perceive patterns" Sir Isaiah Berlin, Russian philosopher

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Bayes Decision Theory

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

EECS490: Digital Image Processing. Lecture #26

The generative approach to classification. A classification problem. Generative models CSE 250B

SGN (4 cr) Chapter 5

Machine Learning (CS 567) Lecture 5

Clustering, K-Means, EM Tutorial

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Linear Models for Classification

Machine Learning Lecture 2

Probabilistic generative models

Lecture 3: Pattern Classification

Introduction to Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Bayesian Decision and Bayesian Learning

Machine Learning Lecture 5

Naïve Bayes classification

Linear Regression and Discrimination

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

p(x ω i 0.4 ω 2 ω

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning Lecture 7

Classification with Gaussians

Introduction to Machine Learning

CSC 411: Lecture 04: Logistic Regression

Introduction to Machine Learning

Introduction to Machine Learning Spring 2018 Note 18

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Gaussian discriminant analysis Naive Bayes

Hidden Markov Models and Gaussian Mixture Models

What does Bayes theorem give us? Lets revisit the ball in the box example.

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Linear Classification

Probabilistic Graphical Models

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

Machine Learning and Pattern Recognition, Tutorial Sheet Number 3 Answers

6.867 Machine Learning

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Naive Bayes and Gaussian Bayes Classifier

Multivariate statistical methods and data mining in particle physics

Naive Bayes and Gaussian Bayes Classifier

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning Linear Classification. Prof. Matteo Matteucci

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Ch 4. Linear Models for Classification

Pattern Recognition. Parameter Estimation of Probability Density Functions

Machine Learning Lecture 2

Bayesian Machine Learning - Lecture 7

Application: Can we tell what people are looking at from their brain activity (in real time)? Gaussian Spatial Smooth

Gaussian Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Minimum Error-Rate Discriminant

Discriminant analysis and supervised classification

Classification. Sandro Cumani. Politecnico di Torino

Statistical Pattern Recognition

L11: Pattern recognition principles

Introduction to Machine Learning

Overfitting, Bias / Variance Analysis

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Bayesian Decision Theory

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

CMU-Q Lecture 24:

Transcription:

Overview Gaussians estimated from training data Guido Sanguinetti Informatics B Learning and Data Lecture 1 9 March 1 Today s lecture Posterior probabilities, decision regions and minimising the probability of Gaussians and discriminant functions Linear discriminants Informatics B: Learning and Data Lecture 1 1 Informatics B: Learning and Data Lecture 1 Informatics B: Learning and Data Lecture 1 3 Decision Regions Decision regions for 3 class example Consider a 1-d feature space (x) and two classes C 1 and C x Minimal s Decision Boundaries Informatics B: Learning and Data Lecture 1 Informatics B: Learning and Data Lecture 1 5 Informatics B: Learning and Data Lecture 1 Consider a 1-d feature space (x) and two classes C 1 and C Consider a 1-d feature space (x) and two classes C 1 and C Consider a 1-d feature space (x) and two classes C 1 and C Informatics B: Learning and Data Lecture 1 Informatics B: Learning and Data Lecture 1 Informatics B: Learning and Data Lecture 1

Consider a 1-d feature space (x) and two classes C 1 and C 1 assigning x to C 1 when it belongs to C (x is in R 1 when it belongs to C ) Consider a 1-d feature space (x) and two classes C 1 and C 1 assigning x to C 1 when it belongs to C (x is in R 1 when it belongs to C ) assigning x to C when it belongs to C 1 (x is in R when it belongs to C 1) Consider a 1-d feature space (x) and two classes C 1 and C 1 assigning x to C 1 when it belongs to C (x is in R 1 when it belongs to C ) assigning x to C when it belongs to C 1 (x is in R when it belongs to C 1) Total probability of error: P(error) = P(x R, C 1 ) + P(x R 1, C ) = P(x R C 1 )P(C 1 ) + P(x R 1 C )P(C ) = p(x C 1 )P(C 1 )dx + p(x C )P(C )dx R R1 Informatics B: Learning and Data Lecture 1 Informatics B: Learning and Data Lecture 1 Informatics B: Learning and Data Lecture 1 Minimising probability of...1.1.1.1.1.1 P(error) = R p(x C 1 )P(C 1 )dx + R1 p(x C )P(C )dx.1.1.1.1........ 1 1 1 1 Informatics B: Learning and Data Lecture 1 7 Informatics B: Learning and Data Lecture 1 Informatics B: Learning and Data Lecture 1 9 Minimising probability of.. P(error) = R p(x C 1 )P(C 1 )dx + R1 p(x C )P(C )dx To minimize P(error): For a given x if p(x C 1 )P(C 1 ) > p(x C )P(C ), then point x should be in region R 1 The probability of is thus minimised by assigning each point to the class with the maximum posterior probability This justification for the maximum posterior probability may be extended to d-dimensional feature vectors and K classes.1.1.1.1.1.... 1 1.1.1.1.1.1.... 1 1 Informatics B: Learning and Data Lecture 1 9 Informatics B: Learning and Data Lecture 1 1 Informatics B: Learning and Data Lecture 1 11

Equation of a Decision Boundary Discriminant Functions If we assign x to class C with the highest posterior probability P(C x), then the posterior probability or the log posterior probability forms a suitable discriminant function: y c (x) = ln P(C x) ln p(x C) + ln P(C) Informatics B: Learning and Data Lecture 1 1 Informatics B: Learning and Data Lecture 1 13 Informatics B: Learning and Data Lecture 1 13 If we assign x to class C with the highest posterior probability P(C x), then the posterior probability or the log posterior probability forms a suitable discriminant function: y c (x) = ln P(C x) ln p(x C) + ln P(C) are defined when the discriminant functions are equal: y k (x) = y l (x) If we assign x to class C with the highest posterior probability P(C x), then the posterior probability or the log posterior probability forms a suitable discriminant function: y c (x) = ln P(C x) ln p(x C) + ln P(C) are defined when the discriminant functions are equal: y k (x) = y l (x) Considering discriminant functions changes the emphasis from how individual classes are modelled to how the boundaries between classes are estimated Informatics B: Learning and Data Lecture 1 13 Informatics B: Learning and Data Lecture 1 13 Informatics B: Learning and Data Lecture 1 1 If the discriminant function is the log posterior probability: y c (x) = ln p(x C) + ln P(C) If the discriminant function is the log posterior probability: y c (x) = ln p(x C) + ln P(C) Then, substituting in the log probability of a Gaussian and dropping constant terms we obtain: y c (x) = 1 (x µ c) T Σ 1 c (x µ c ) 1 ln Σ c + ln P(C) If the discriminant function is the log posterior probability: y c (x) = ln p(x C) + ln P(C) Then, substituting in the log probability of a Gaussian and dropping constant terms we obtain: y c (x) = 1 (x µ c) T Σ 1 c (x µ c ) 1 ln Σ c + ln P(C) This function is quadratic in x Informatics B: Learning and Data Lecture 1 1 Informatics B: Learning and Data Lecture 1 1 Informatics B: Learning and Data Lecture 1 1

Gaussians estimated from training data Decision Regions Decision regions for 3 class example x Informatics B: Learning and Data Lecture 1 1 Informatics B: Learning and Data Lecture 1 15 Non-equal covariance Gaussians Equal Covariance Gaussians estimated from training data Testing data (Equal Covariance Gaussians) Informatics B: Learning and Data Lecture 1 19 Informatics B: Learning and Data Lecture 1 17 Informatics B: Learning and Data Lecture 1 Results Non-equal covariance Gaussians True class Test Data A B C Predicted A 77 5 9 class B 15 C 7 9 Fraction correct: (77 + + 9)/3 = 5/3 =.5. Results Non-equal covariance Gaussians True class Test Data A B C Predicted A 77 5 9 class B 15 C 7 9 Fraction correct: (77 + + 9)/3 = 5/3 =.5. Equal covariance Gaussians True class Test Data A B C Predicted A 1 x Testing data (Non-equal covariance) Decision regions: Equal Covariance Gaussians Decision Regions: equal covariance Informatics B: Learning and Data Lecture 1 1

Decision Regions: non-equal covariance Decision regions for 3 class example By dropping terms that are now constant we can simplify the discriminant function to x y c (x) = (µ T c Σ 1 )x 1 µt c Σ 1 µ c + ln P(C) This is a linear function of x Informatics B: Learning and Data Lecture 1 3 Informatics B: Learning and Data Lecture 1 Informatics B: Learning and Data Lecture 1 By dropping terms that are now constant we can simplify the discriminant function to y c (x) = (µ T c Σ 1 )x 1 µt c Σ 1 µ c + ln P(C) This is a linear function of x We can define two variables in terms of µ c, P(C) and Σ: w T c = µ T c Σ 1 w c = 1 µt c Σ 1 µ c + ln P(C) Informatics B: Learning and Data Lecture 1 By dropping terms that are now constant we can simplify the discriminant function to y c (x) = (µ T c Σ 1 )x 1 µt c Σ 1 µ c + ln P(C) This is a linear function of x We can define two variables in terms of µ c, P(C) and Σ: w T c = µ T c Σ 1 w c = 1 µt c Σ 1 µ c + ln P(C) Substituting w c and w c into the expression for y c (x): Linear discriminant function y c (x) = w T c x + w c Informatics B: Learning and Data Lecture 1 Linear discriminant: decision boundary for equal covariance Gaussians x In two dimensions the boundary is a line C1 In three dimensions it is a plane In d dimensions it is a hyperplane C x1 y1(x)=y(x) Informatics B: Learning and Data Lecture 1 5 Spherical Gaussians with Equal Covariance Spherical Gaussians with Equal Covariance Spherical Gaussians with Equal Covariance Spherical Gaussians have a diagonal covariance matrix, with the same variance in each dimension Σ = σ I Spherical Gaussians have a diagonal covariance matrix, with the same variance in each dimension Σ = σ I If we further assume that the prior probabilities of each class are equal, we can write the discriminant function as y c (x) = x µ c σ Assigns a test vector to the class whose mean is closest. Spherical Gaussians have a diagonal covariance matrix, with the same variance in each dimension Σ = σ I If we further assume that the prior probabilities of each class are equal, we can write the discriminant function as y c (x) = x µ c σ Assigns a test vector to the class whose mean is closest. In this case the class means may be regarded as class templates or prototypes. Informatics B: Learning and Data Lecture 1 Informatics B: Learning and Data Lecture 1 Informatics B: Learning and Data Lecture 1

Two-class linear discriminants For a two class problem, the log odds can be used as a single discriminant function: y(x) = ln p(x C 1 ) ln p(x C ) + ln P(C 1 ) ln P(C ) The decision boundary is given by y(x) = Two-class linear discriminants For a two class problem, the log odds can be used as a single discriminant function: y(x) = ln p(x C 1 ) ln p(x C ) + ln P(C 1 ) ln P(C ) The decision boundary is given by y(x) = If the pdf is Gaussian, we have a linear discriminant y(x) = w T x + w w and w are functions of µ 1, µ, Σ, P(C 1 ) and P(C ) Two-class linear discriminants For a two class problem, the log odds can be used as a single discriminant function: y(x) = ln p(x C 1 ) ln p(x C ) + ln P(C 1 ) ln P(C ) The decision boundary is given by y(x) = If the pdf is Gaussian, we have a linear discriminant y(x) = w T x + w w and w are functions of µ 1, µ, Σ, P(C 1 ) and P(C ) Let x a and x b be two points on the decision boundary. Then: y(x a ) = = y(x b ) w T x a + w = = w T x b + w And a little rearranging gives us: w T (x a x b ) = Informatics B: Learning and Data Lecture 1 7 Informatics B: Learning and Data Lecture 1 7 Informatics B: Learning and Data Lecture 1 7 Geometry of a two-class linear discriminant Summary x w y( x)= w is normal to any vector on the hyperplane decision boundary If x is a point on the hyperplane, then the normal distance from the hyperplane to the origin is given by: l = wt x w = w w w w (using y(x) = ) Informatics B: Learning and Data Lecture 1 Obtaining decision boundaries from probability models and a decision rule Minimising the probability of error and Gaussian pdfs Linear discriminants and Next lecture: training discriminants directly Informatics B: Learning and Data Lecture 1 9