Bayesian Decision Theory

Similar documents
Contents 2 Bayesian decision theory

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Bayes Decision Theory

Bayesian Decision Theory Lecture 2

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Bayesian Decision and Bayesian Learning

Bayesian Decision Theory

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Pattern Classification

p(x ω i 0.4 ω 2 ω

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Minimum Error Rate Classification

Error Rates. Error vs Threshold. ROC Curve. Biometrics: A Pattern Recognition System. Pattern classification. Biometrics CSE 190 Lecture 3

p(x ω i 0.4 ω 2 ω

44 CHAPTER 2. BAYESIAN DECISION THEORY

Bayesian Decision Theory

Minimum Error-Rate Discriminant

Bayes Decision Theory

Lecture Notes on the Gaussian Distribution

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Informatics 2B: Learning and Data Lecture 10 Discriminant functions 2. Minimal misclassifications. Decision Boundaries

L11: Pattern recognition principles

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

5. Discriminant analysis

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Classification. 1. Strategies for classification 2. Minimizing the probability for misclassification 3. Risk minimization

Application: Can we tell what people are looking at from their brain activity (in real time)? Gaussian Spatial Smooth

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

CS 195-5: Machine Learning Problem Set 1

Motivating the Covariance Matrix

Bayes Rule for Minimizing Risk

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Example - basketball players and jockeys. We will keep practical applicability in mind:

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

COM336: Neural Computing

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

The generative approach to classification. A classification problem. Generative models CSE 250B

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Learning. Bayesian Learning Criteria

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)?

Introduction to Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Introduction to Machine Learning Spring 2018 Note 18

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

CS 340 Lec. 18: Multivariate Gaussian Distributions and Linear Discriminant Analysis

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Introduction to Machine Learning

Algorithm-Independent Learning Issues

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier

Introduction to Machine Learning

Machine Learning 2017

Stephen Scott.

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Algorithmisches Lernen/Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CMU-Q Lecture 24:

Lecture 8: Classification

CSC411 Fall 2018 Homework 5

Bayes Decision Theory - I

Machine Learning Linear Classification. Prof. Matteo Matteucci

Deep Learning for Computer Vision

Machine Learning (CS 567) Lecture 5

Machine Learning for Signal Processing Bayes Classification

Introduction to Statistical Inference

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Parametric Techniques

CS340 Machine learning Gaussian classifiers

Pattern Recognition. Parameter Estimation of Probability Density Functions

Detection theory. H 0 : x[n] = w[n]

Classification. Sandro Cumani. Politecnico di Torino

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Naive Bayes and Gaussian Bayes Classifier

The Multivariate Gaussian Distribution

Introduction to Machine Learning

Ch 4. Linear Models for Classification

What does Bayes theorem give us? Lets revisit the ball in the box example.

Nearest Neighbor Pattern Classification

Parametric Techniques Lecture 3

The Bayes classifier

LECTURE NOTE #3 PROF. ALAN YUILLE

+ + ( + ) = Linear recurrent networks. Simpler, much more amenable to analytic treatment E.g. by choosing

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian Models (9/9/13)

INTRODUCTION TO PATTERN RECOGNITION

Image Processing 1 (IP1) Bildverarbeitung 1

COS513 LECTURE 8 STATISTICAL CONCEPTS

DETECTION theory deals primarily with techniques for

Linear Classification

Detection theory 101 ELEC-E5410 Signal Processing for Communications

12 Discriminant Analysis

Part 2 Elements of Bayesian Decision Theory

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Transcription:

Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 1 / 46

Bayesian Decision Theory Bayesian Decision Theory is a fundamental statistical approach that quantifies the tradeoffs between various decisions using probabilities and costs that accompany such decisions. First, we will assume that all probabilities are known. Then, we will study the cases where the probabilistic structure is not completely known. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 2 / 46

Fish Sorting Example Revisited State of nature is a random variable. Define w as the type of fish we observe (state of nature, class) where w = w1 for sea bass, w = w2 for salmon. P (w1 ) is the a priori probability that the next fish is a sea bass. P (w2 ) is the a priori probability that the next fish is a salmon. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 3 / 46

Prior Probabilities Prior probabilities reflect our knowledge of how likely each type of fish will appear before we actually see it. How can we choose P (w 1 ) and P (w 2 )? Set P (w1 ) = P (w 2 ) if they are equiprobable (uniform priors). May use different values depending on the fishing area, time of the year, etc. Assume there are no other types of fish (exclusivity and exhaustivity). P (w 1 ) + P (w 2 ) = 1 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 4 / 46

Making a Decision How can we make a decision with only the prior information? Decide w 1 if P (w 1 ) > P (w 2 ) otherwise w 2 What is the probability of error for this decision? P (error) = min{p (w 1 ), P (w 2 )} CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 5 / 46

Class-Conditional Probabilities Let s try to improve the decision using the lightness measurement x. Let x be a continuous random variable. Define p(x w j ) as the class-conditional probability density (probability of x given that the state of nature is w j for j = 1, 2). p(x w 1 ) and p(x w 2 ) describe the difference in lightness between populations of sea bass and salmon. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 6 / 46

Class-Conditional Probabilities Figure 1: Hypothetical class-conditional probability density functions for two classes. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 7 / 46

Posterior Probabilities Suppose we know P (w j ) and p(x w j ) for j = 1, 2, and measure the lightness of a fish as the value x. Define P (w j x) as the a posteriori probability (probability of the state of nature being w j given the measurement of feature value x). We can use the Bayes formula to convert the prior probability to the posterior probability P (w j x) = p(x w j)p (w j ) p(x) where p(x) = 2 j=1 p(x w j)p (w j ). CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 8 / 46

Making a Decision p(x w j ) is called the likelihood and p(x) is called the evidence. How can we make a decision after observing the value of x? w 1 if P (w 1 x) > P (w 2 x) Decide otherwise w 2 Rewriting the rule gives w 1 if p(x w 1) Decide otherwise w 2 > P (w 2) p(x w 2 ) P (w 1 ) Note that, at every x, P (w 1 x) + P (w 2 x) = 1. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 9 / 46

Probability of Error What is the probability of error for this decision? P (w 1 x) if we decide w 2 P (error x) = P (w 2 x) if we decide w 1 What is the average probability of error? P (error) = p(error, x) dx = P (error x) p(x) dx Bayes decision rule minimizes this error because P (error x) = min{p (w 1 x), P (w 2 x)}. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 10 / 46

Bayesian Decision Theory How can we generalize to more than one feature? replace the scalar x by the feature vector x more than two states of nature? just a difference in notation allowing actions other than just decisions? allow the possibility of rejection different risks in the decision? define how costly each action is CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 11 / 46

Bayesian Decision Theory Let {w 1,..., w c } be the finite set of c states of nature (classes, categories). Let {α 1,..., α a } be the finite set of a possible actions. Let λ(α i w j ) be the loss incurred for taking action α i when the state of nature is w j. Let x be the d-component vector-valued random variable called the feature vector. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 12 / 46

Bayesian Decision Theory p(x w j ) is the class-conditional probability density function. P (w j ) is the prior probability that nature is in state w j. The posterior probability can be computed as P (w j x) = p(x w j)p (w j ) p(x) where p(x) = c j=1 p(x w j)p (w j ). CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 13 / 46

Conditional Risk Suppose we observe x and take action α i. If the true state of nature is w j, we incur the loss λ(α i w j ). The expected loss with taking action α i is R(α i x) = c λ(α i w j )P (w j x) j=1 which is also called the conditional risk. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 14 / 46

Minimum-Risk Classification The general decision rule α(x) tells us which action to take for observation x. We want to find the decision rule that minimizes the overall risk R = R(α(x) x) p(x) dx. Bayes decision rule minimizes the overall risk by selecting the action α i for which R(α i x) is minimum. The resulting minimum overall risk is called the Bayes risk and is the best performance that can be achieved. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 15 / 46

Two-Category Classification Define α1 : deciding w 1, α2 : deciding w 2, λij = λ(α i w j ). Conditional risks can be written as R(α 1 x) = λ 11 P (w 1 x) + λ 12 P (w 2 x), R(α 2 x) = λ 21 P (w 1 x) + λ 22 P (w 2 x). CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 16 / 46

Two-Category Classification The minimum-risk decision rule becomes w 1 if (λ 21 λ 11 )P (w 1 x) > (λ 12 λ 22 )P (w 2 x) Decide otherwise w 2 This corresponds to deciding w 1 if p(x w 1 ) p(x w 2 ) > (λ 12 λ 22 ) (λ 21 λ 11 ) P (w 2 ) P (w 1 ) comparing the likelihood ratio to a threshold that is independent of the observation x. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 17 / 46

Minimum-Error-Rate Classification Actions are decisions on classes (α i is deciding w i ). If action α i is taken and the true state of nature is w j, then the decision is correct if i = j and in error if i j. We want to find a decision rule that minimizes the probability of error. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 18 / 46

Minimum-Error-Rate Classification Define the zero-one loss function 0 if i = j λ(α i w j ) = 1 if i j i, j = 1,..., c (all errors are equally costly). Conditional risk becomes c R(α i x) = λ(α i w j ) P (w j x) j=1 = P (w j x) j i = 1 P (w i x). CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 19 / 46

Minimum-Error-Rate Classification Minimizing the risk requires maximizing P (w i x) and results in the minimum-error decision rule Decide w i if P (w i x) > P (w j x) j i. The resulting error is called the Bayes error and is the best performance that can be achieved. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 20 / 46

Minimum-Error-Rate Classification Figure 2: The likelihood ratio p(x w 1 )/p(x w 2 ). The threshold θ a is computed using the priors P (w 1 ) = 2/3 and P (w 2 ) = 1/3, and a zero-one loss function. If we penalize mistakes in classifying w 2 patterns as w 1 more than the converse, we should increase the threshold to θ b. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 21 / 46

Discriminant Functions A useful way of representing classifiers is through discriminant functions g i (x), i = 1,..., c, where the classifier assigns a feature vector x to class w i if g i (x) > g j (x) j i. For the classifier that minimizes conditional risk g i (x) = R(α i x). For the classifier that minimizes error g i (x) = P (w i x). CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 22 / 46

Discriminant Functions These functions divide the feature space into c decision regions (R 1,..., R c ), separated by decision boundaries. Note that the results do not change even if we replace every g i (x) by f(g i (x)) where f( ) is a monotonically increasing function (e.g., logarithm). This may lead to significant analytical and computational simplifications. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 23 / 46

The Gaussian Density Gaussian can be considered as a model where the feature vectors for a given class are continuous-valued, randomly corrupted versions of a single typical or prototype vector. Some properties of the Gaussian: Analytically tractable. Completely specified by the 1st and 2nd moments. Has the maximum entropy of all distributions with a given mean and variance. Many processes are asymptotically Gaussian (Central Limit Theorem). Linear transformations of a Gaussian are also Gaussian. Uncorrelatedness implies independence. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 24 / 46

Univariate Gaussian For x R: p(x) = N(µ, σ 2 ) [ = 1 exp 1 ( ) ] 2 x µ 2πσ 2 σ where µ = E[x] = σ 2 = E[(x µ) 2 ] = x p(x) dx, (x µ) 2 p(x) dx. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 25 / 46

Univariate Gaussian Figure 3: A univariate Gaussian distribution has roughly 95% of its area in the range x µ 2σ. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 26 / 46

Multivariate Gaussian For x R d : p(x) = N(µ, Σ) [ 1 = exp 1 ] (2π) d/2 Σ 1/2 2 (x µ)t Σ 1 (x µ) where µ = E[x] = x p(x) dx, Σ = E[(x µ)(x µ) T ] = (x µ)(x µ) T p(x) dx. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 27 / 46

Multivariate Gaussian Figure 4: Samples drawn from a two-dimensional Gaussian lie in a cloud centered on the mean µ. The loci of points of constant density are the ellipses for which (x µ) T Σ 1 (x µ) is constant, where the eigenvectors of Σ determine the direction and the corresponding eigenvalues determine the length of the principal axes. The quantity r 2 = (x µ) T Σ 1 (x µ) is called the squared Mahalanobis distance from x to µ. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 28 / 46

Linear Transformations Recall that, given x R d, A R d k, y = A T x R k, if x N(µ, Σ), then y N(A T µ, A T ΣA). As a special case, the whitening transform A w = ΦΛ 1/2 where Φ is the matrix whose columns are the orthonormal eigenvectors of Σ, Λ is the diagonal matrix of the corresponding eigenvalues, gives a covariance matrix equal to the identity matrix I. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 29 / 46

Discriminant Functions for the Gaussian Density Discriminant functions for minimum-error-rate classification can be written as g i (x) = ln p(x w i ) + ln P (w i ). For p(x w i ) = N(µ i, Σ i ) g i (x)= 1 2 (x µ i) T Σ 1 i (x µ i ) d 2 ln2π 1 2 ln Σ i +lnp (w i ). CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 30 / 46

Case 1: Σ i = σ 2 I Discriminant functions are g i (x) = w T i x + w i0 (linear discriminant) where w i = 1 σ 2 µ i w i0 = 1 2σ 2 µt i µ i + ln P (w i ) (w i0 is the threshold or bias for the i th category). CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 31 / 46

Case 1: Σ i = σ 2 I Decision boundaries are the hyperplanes g i (x) = g j (x), and can be written as w T (x x 0 ) = 0 where w = µ i µ j x 0 = 1 2 (µ i + µ j ) σ 2 µ i µ j ln P (w i) 2 P (w j ) (µ i µ j ). Hyperplane separating R i and R j passes through the point x 0 and is orthogonal to the vector w. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 32 / 46

Case 1: Σ i = σ 2 I Figure 5: If the covariance matrices of two distributions are equal and proportional to the identity matrix, then the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane of d 1 dimensions, perpendicular to the line separating the means. The decision boundary shifts as the priors are changed. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 33 / 46

Case 1: Σ i = σ 2 I Special case when P (w i ) are the same for i = 1,..., c is the minimum-distance classifier that uses the decision rule assign x to w i where i = arg min i=1,...,c x µ i. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 34 / 46

Case 2: Σ i = Σ Discriminant functions are g i (x) = w T i x + w i0 (linear discriminant) where w i = Σ 1 µ i w i0 = 1 2 µt i Σ 1 µ i + ln P (w i ). CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 35 / 46

Case 2: Σ i = Σ Decision boundaries can be written as where w = Σ 1 (µ i µ j ) x 0 = 1 2 (µ i + µ j ) w T (x x 0 ) = 0 ln(p (w i )/P (w j )) (µ i µ j ) T Σ 1 (µ i µ j ) (µ i µ j ). Hyperplane passes through x 0 but is not necessarily orthogonal to the line between the means. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 36 / 46

Case 2: Σ i = Σ Figure 6: Probability densities with equal but asymmetric Gaussian distributions. The decision hyperplanes are not necessarily perpendicular to the line connecting the means. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 37 / 46

Case 3: Σ i = arbitrary Discriminant functions are g i (x) = x T W i x + w T i x + w i0 (quadratic discriminant) where W i = 1 2 Σ 1 i w i = Σ 1 i µ i w i0 = 1 2 µt i Σ 1 i µ i 1 2 ln Σ i + ln P (w i ). Decision boundaries are hyperquadrics. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 38 / 46

Case 3: Σ i = arbitrary Figure 7: Arbitrary Gaussian distributions lead to Bayes decision boundaries that are general hyperquadrics. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 39 / 46

Case 3: Σ i = arbitrary Figure 8: Arbitrary Gaussian distributions lead to Bayes decision boundaries that are general hyperquadrics. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 40 / 46

Error Probabilities and Integrals For the two-category case P (error) = P (x R 2, w 1 ) + P (x R 1, w 2 ) = P (x R 2 w 1 )P (w 1 ) + P (x R 1 w 2 )P (w 2 ) = p(x w 1 ) P (w 1 ) dx + p(x w 2 ) P (w 2 ) dx. R 2 R 1 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 41 / 46

Error Probabilities and Integrals For the multicategory case P (error) = 1 P (correct) c = 1 P (x R i, w i ) = 1 = 1 i=1 c P (x R i w i )P (w i ) i=1 c i=1 R i p(x w i ) P (w i ) dx. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 42 / 46

Error Probabilities and Integrals Figure 9: Components of the probability of error for equal priors and the non-optimal decision point x. The optimal point x B minimizes the total shaded area and gives the Bayes error rate. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 43 / 46

Receiver Operating Characteristics Consider the two-category case and define w1 : target is present, w2 : target is not present. Table 1: Confusion matrix. Assigned w 1 w 2 True w 1 correct detection mis-detection w 2 false alarm correct rejection Mis-detection is also called false negative or Type II error. False alarm is also called false positive or Type I error. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 44 / 46

Receiver Operating Characteristics If we use a parameter (e.g., a threshold) in our decision, the plot of these rates for different values of the parameter is called the receiver operating characteristic (ROC) curve. Figure 10: Example receiver operating characteristic (ROC) curves for different settings of the system. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 45 / 46

Summary To minimize the overall risk, choose the action that minimizes the conditional risk R(α x). To minimize the probability of error, choose the class that maximizes the posterior probability P (w j x). If there are different penalties for misclassifying patterns from different classes, the posteriors must be weighted according to such penalties before taking action. Do not forget that these decisions are the optimal ones under the assumption that the true values of the probabilities are known. CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 46 / 46