Bayes Decision Theory

Similar documents
Logistic Regression. Seungjin Choi

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Density Estimation. Seungjin Choi

Linear Models for Regression

Bayesian Decision Theory

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Fisher s Linear Discriminant Analysis

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Linear Models for Regression

Classification. 1. Strategies for classification 2. Minimizing the probability for misclassification 3. Risk minimization

Ch 4. Linear Models for Classification

CMU-Q Lecture 24:

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

44 CHAPTER 2. BAYESIAN DECISION THEORY

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Nonnegative Matrix Factorization

Naïve Bayes classification

Information Theory Primer:

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Application: Can we tell what people are looking at from their brain activity (in real time)? Gaussian Spatial Smooth

Conditional Independence and Factorization

Nonparameteric Regression:

Machine Learning Linear Classification. Prof. Matteo Matteucci

p(x ω i 0.4 ω 2 ω

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning for Signal Processing Bayes Classification

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Expectation Maximization

Linear Classification: Probabilistic Generative Models

Bayesian Decision Theory

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Bayesian Decision and Bayesian Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Gaussian discriminant analysis Naive Bayes

Cheng Soon Ong & Christian Walder. Canberra February June 2018

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Support Vector Machines

5. Discriminant analysis

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

PATTERN RECOGNITION AND MACHINE LEARNING

Minimum Error Rate Classification

Introduction to Machine Learning

Notes on Discriminant Functions and Optimal Classification

Kernel Principal Component Analysis

Introduction to Automata

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Probabilistic Latent Semantic Analysis

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Machine Learning Lecture 5

Introduction to Machine Learning

Machine Learning Lecture 2

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning 2017

Bayesian Decision Theory Lecture 2

Bayes Rule for Minimizing Risk

CS340 Machine learning Gaussian classifiers

Introduction to Machine Learning

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Logistic Regression. Machine Learning Fall 2018

Motivating the Covariance Matrix

Bayes Decision Theory

Contents 2 Bayesian decision theory

Independent Component Analysis (ICA)

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Machine Learning Lecture 2

p(x ω i 0.4 ω 2 ω

LDA, QDA, Naive Bayes

Classification 1: Linear regression of indicators, linear discriminant analysis

Finite Automata. Seungjin Choi

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

COM336: Neural Computing

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Introduction to Machine Learning Spring 2018 Note 18

Properties of Context-Free Languages

Introduction to Machine Learning

The Naïve Bayes Classifier. Machine Learning Fall 2017

CSC 411: Lecture 09: Naive Bayes

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Context-Free Grammars: Normal Forms

Classification Methods II: Linear and Quadratic Discrimminant Analysis

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

CS Machine Learning Qualifying Exam

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Naive Bayes and Gaussian Bayes Classifier

Support Vector Machine. Industrial AI Lab.

Recap from previous lecture

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

SF2935: MODERN METHODS OF STATISTICAL LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS LEARNING. Tatjana Pavlenko.

Naive Bayes and Gaussian Bayes Classifier

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Transcription:

Bayes Decision Theory Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 / 16

Bayes Decision Theory Consider a set of feature vectors, {x}, which belong to either class C 1 or C 2 with prior probability P(C i ). A fundamental question arises: What would be a best way to assign an appropriate class label to a data point x? Decide C 1 if P(C 1 ) > P(C 2 )? Not a good idea (little information) Bayes decision theory gives an answer to this fundamental question. Bayes decision rule: Decide C 1 if P(C 1 x) > P(C 2 x). Question: Does this Bayes decision rule give the minimal probability of misclassification? Yes! 2 / 16

Bayes Rule P(C k x) = class-conditional density {}}{ P(x C k ) j P(x C j )P(C j ) }{{} normalization factor prior {}}{ P(C k ). In practice, we model the class-conditional density P(x C k ) by a parameterized form. 3 / 16

Decision Boundary: Case 1 P (x C 1 )P (C 1 ) P (x C 2 )P (C 2 ) R 1 R 2 4 / 16

Decision Boundary: Case 2 P (x C 1 )P (C 1 ) P (x C 2 )P (C 2 ) R 1 R 2 5 / 16

Decision Boundaries Decision boundaries are boundaries between decision regions. The probability of misclassification in the binary classification problem, is given by P(error) = P(x R 2, C 1 ) + P(x R 1, C 2 ) = P(x R 2 C 1 )P(C 1 ) + P(x R 1 C 2 )P(C 2 ) = P(x C 1 )P(C 1 )dx + P(x C 2 )P(C 2 ). R 2 R 1 One can observe that if P(x C 1 )P(C 1 ) > P(x C 2 )P(C 2 ), we should choose the regions R 1 and R 2 such that x is in R 1 since this give a smaller contribution to the error. 6 / 16

Alternatively, we consider P(correct) = = = K P(x R k, C k ) k=1 K P(x R k C k )P(C k ) k=1 K k=1 R k P(x C k )P(C k ). This probability is maximized by choosing {R k } such that x is assigned to the class for which the integrand is a maximum. 7 / 16

Discriminant Functions Decide C k if f k (x) > f j (x) j k. Choose f k (x) = P(C k x) = P(x C k )P(C k ) where P(x) is dropped. If f k (x) is a discriminant function then g(f k (x)) af k (x) a > 0 f k (x) + b g is a monotonically increasing function are also eligible discriminant functions. Then, f k (x) = log [P(x C k )P(C k )] = log P(x C k ) + log P(C k ) is also a discriminant function. 8 / 16

Discriminant Functions for Normal Density Consider discriminant functions f i (x) = log P(x C i ) + log P(C i ). Assume P(x C i ) N (µ i, Σ i ). Then, discriminant functions have the form [ { 1 f i (x) = log exp 1 (2π) m 2 Σ 1 2 2 (x µ i) T Σ 1 i (x µ i )} ] + log P(C i ) = 1 2 (x µ i) T Σ 1 i (x µ i ) m 2 log 2π 1 2 log Σ i + log P(C i ). 9 / 16

Case 1: Σ = σ 2 I Features are statistically independent with the same variance σ 2. Hyperspherical cluster. Σ i = σ 2m, Σ 1 = 1 σ 2 I. f i (x) = 1 2σ 2 x µ i 2 + log P(C i ). If P(C i ) are the same for all classes, then the discriminant functions become f i (x) = 1 2σ 2 x µ i 2, which is a minimum distance classifier. 10 / 16

Case 1 Leads to Linear Discriminant Functions In case 1, the linear discriminant function f i (x) can be rewritten as where f i (x) = w T i x + w i0, w i = 1 σ 2 µ i, w i0 = 1 2σ 2 µt i µ i + log P(C i ). Decision boundaries are hyperplanes defined by f i (x) = f j (x). 11 / 16

Case 2: Σ i = Σ In such a case, discriminant functions are given by f i (x) = 1 2 (x µ i) T Σ 1 (x µ) + log P(C i ). If P(C i ) are the same for all classes, the discriminant function is simply based on the Mahalanobis distance, (x µ i ) T Σ 1 (x µ). Case 2 also leads to linear discriminant functions which have the form where f i (x) = w T i x + w i0, w i = Σ 1 µ i, w i0 = 1 2 µt i Σ 1 µ i + log P(C i ). 12 / 16

Case 3: Arbitrary Σ i In such a case, discriminant functions have the form where f i (x) = x T W i x + w T i + w i0, W i = 1 2 Σ 1 i, w i = Σ 1 i µ i, w i0 = 1 2 µt i Σ 1 i µ i 1 2 log Σ i + log P(C i ). This case leads to a quadratic discriminant function. The decision boundaries are hyperquadrics. They can assume any of the general forms such as pairs of hyperplanes, hyperspheres, hyperellipsoids, and hyperparaboloids. 13 / 16

Loss Function and Expected Loss Suppose that we are given a set of training data, {(x i, y i )} N i=1. Loss function, l(f (x), y), quantifies the loss or cost associating with the prediction f (x) when the data x were actually labeled with y. Expected loss is defined as L(f (x), y) = E p(y x ) [l(f (x), y)] = l(f (x), y)p(y x)dy. 14 / 16

0-1 Loss The 0-1 binary loss function is of the form: l(f (x), y) = 1 δ f (x ),y = { 0, if f (x) = y 1, otherwise. It makes most sense when the hypothesis space is discrete. The expected loss is given by L(f (x), y) = y l(f (x), y)p(y x) = y (1 δ f (x ),y )p(y x) = y p(y x) y δ f (x ),y p(y x) = 1 p(f (x) x). The expected loss is minimized when f (x) is chosen to be the maximum of the posterior distribution p(y x), i.e., MAP estimate. 15 / 16

Squared Loss The squared error loss function is of the form: l(f (x), y) = (y f (x)) 2. It is most appropriate when y lives in a continuous space with a well-defined metric. The expected loss is given by L(f (x), y) = l(f (x), y)p(y x)dy = (y f (x)) 2 p(y x)dy = y 2 p(y x)dy + f (x) 2 p(y x)dy 2f (x) y p(y x)dy = f (x) 2 2f (x)e[y x] + E[y 2 x] ] = (f (x) E[y x]) 2 + E [(y E[y x]) 2 x, which is minimized when f (x) = E[y x]. 16 / 16