Part 2 Elements of Bayesian Decision Theory

Size: px

Start display at page:

Download "Part 2 Elements of Bayesian Decision Theory"

Lucinda Merritt
5 years ago
Views:

1 Part 2 Elements of Bayesian Decision Theory Machine Learning, Part 2, March 2017 Fabio Roli 1

2 Introduction ØStatistical pattern classification is grounded into Bayesian decision theory, therefore, knowing the elements of this theory is a must for anybody wants to work in this field. ØBayesian theory assumes that decision making problems are formulated in probabilistic terms. Probabilities are known or should be estimated. ØBayes decision theory allows to take into account both probability and risk of decisions. Making a rational decision means to take into account both the probability and the risk (or the utility associated to the decision. ØWe start our presentation of elements of Bayesian decision theory assuming that all the probabilities involved in the problem considered are known. Machine Learning, Part 2, March 2017 Fabio Roli 2

First thing to know: the MAP decision rule ØWe present the fundamental decision rule of Bayesian decision theory using the example of the salmon/sea bass classification introduced in Part 1.

3 First thing to know: the MAP decision rule ØWe present the fundamental decision rule of Bayesian decision theory using the example of the salmon/sea bass classification introduced in Part 1. Let us assume that an image segmentation module has already extracted the shape of the fishes as shown in the figure, and a feature extraction module has characterized each shape/pattern with one feature: the average lightness of the shape. Decision problem: we want to assign each shape/pattern to one of the two classes considered (salmon, sea bass. Machine Learning, Part 2, March 2017 Fabio Roli 3

4 First thing to know: the MAP decision rule We assume that we cannot know deterministically which is the class (salmon or sea bass of the next fish incoming on the conveyor belt. So the problem must be formulated in probabilistic terms. Next incoming fish can be a salmon or a sea bass with a given probability. Bayes decision theory formalizes this situation with the concept of state of nature (usually called class in pattern recognition. In our example, we have two states-of-nature/classes: ω 1 and ω 2 Let ω =ω 1 or ω =ω 2 be the variable that identifies the class, where ω is a random variable. ØThe two classes could have the same prior probability: P(ω 1 = P(ω 2 P(ω 1 + P(ω 2 = 1 (we have just two species of fish Machine Learning, Part 2, March 2017 Fabio Roli 4

5 The MAP decision rule If we should make a decision without being able to see the incoming fish, the only rational decision would be: Assign the fish to ω 1 if P(ω 1 > P(ω 2, else assign the fish to ω 2 This blind (a priori decision works well only if one class is much more likely, e.g., P(ω 1 >> P(ω 2 (and, as we see later, the two decisions have the same risk. ØIn general, we must see the pattern to make a rational decision according to Bayesian theory. We must see the fish and characterize it with some features. For example, the average lightness of the pattern. As fishes incoming on the belt will have random lightness values, the lightness feature x should be treated as a random variable with conditional distribution p(x ω i. Machine Learning, Part 2, March 2017 Fabio Roli 5

6 An example of a mono-dimensional p(x ω i p(x ω i is the class-conditional probability density function If x is the average lightness of the image region associated to a given fish of class ω i, then the difference between the functions p(x ω i characterizes the expected lightness difference between the two fish species. Machine Learning, Part 2, March 2017 Fabio Roli 6

7 Bayes decision rule Let us assume to know the two priors P(ω j and the two classconditional density functions p(x ω j, j=1,2. If we measure the average lightness x of the incoming fish, taking into account the probabilistic nature of the problem, the most rationale decision rule is based on the probability: P(ω j, x = P(ω j x p(x = p(x ω j P(ω j That we can rewrite as the Bayes decision rule (MAP, maximum a posteriori, decision rule : P(ω j x = p(x ω j P (ω j / p(x Posterior = (Likelihood * Prior / Evidence Note that: 2 p( x = p( x ω P( ω j= 1 j Machine Learning, Part 2, March 2017 Fabio Roli 7 j

8 An example of mono-dimensional P(ω i x P(ω i x with P(ω 1 =2/3 e P(ω 2 =1/3 Machine Learning, Part 2, March 2017 Fabio Roli 8

9 The MAP decision rule The MAP, maximum a posteriori probability, criterion is the most rationale decision rule for the considered probabilistic setting: If P(ω 1 x > P(ω 2 x then is most rationale to assign x to ω 1 If P(ω 1 x < P(ω 2 x then is most rationale to assign x to ω 2 This rule is the most rationale because it minimizes the error probability for any given x: P(error x = P(ω 1 x if we assign x to ω 2 P(error x = P(ω 2 x if we assign x to ω 1 We can prove that MAP rule also minimizes the average error: + + P( errore = P( errore, x dx = P( errore x p( x dx Machine Learning, Part 2, March 2017 Fabio Roli 9

10 Likelihood ratio test and ML rule We can reformulate the MAP rule as follows: If p(x ω 1 P(ω 1 > p(x ω 2 P(ω 2 then assign x to ω 1 else assign x to ω 2 Likelihood ratio test l( x p( x / ω1 p( x / ω Note: the evidence p(x does not matter! = 2 We match the likelihood ratio l(x against a threshold θ not depending on x P( ω2 P( ω Note: the likelihood ratio is matched against the ratio between priors Two special cases: If p(x ω 1 =p(x ω 2, then the decision depends only on priors If P(ω 1 =P(ω 2, then the decision depends only on likelihoods (ML, Maximum Likelihood, decision rule Machine Learning, Part 2, March 2017 Fabio Roli 10 ω > < ω = θ

11 The concept of decision regions The likelihood ratio test is defined by l(x and the threshold θ. This test identifies two decision regions R 1 e R 2 in the feature space R (here we consider a single feature x. ØR 1 = {x ε R: l(x > θ} and R 2 = {x ε R: l(x < θ} (if l(x = θ then x can be assigned randomly to R 1 or R 2. Given the probability density functions p(x ω 1 and p(x ω 2, the regions R 1 and R 2 are identified by the threshold θ. R 1 x =x(θ R 2 A Gaussian example R 1 = R 1 (θ R 2 = R 2 (θ Machine Learning, Part 2, March 2017 Fabio Roli 11

12 MAP decision rule with more than two classes The MAP decision rule with more than two classes is: x ω P( ω x > P( ω x i j, i=1,...,c i i j Ø The Likelihood ratio test is defined accordingly. ØIt is easy to see that we should have multiple thresholds θ st defined according to the following rule : P( ω x > P( ω x s, t i, s t i=1,...,c s P( ω x > P( ω x t i i ØIn this example, we have three classes and two thresholds θ 12 e θ 23. P(ω i /x θ 12 θ 23 x Machine Learning, Part 2, March 2017 Fabio Roli 12

13 Basic concepts of error probability For the two class case: P(error = P{x R 2, ω 1 } + P{x R 1, ω 2 } = The optimal threshold x=x B is the Bayesian threshold providing the minimum error, called Bayes error. In the figure x=x* is a suboptimal threshold, that brings to an added (reducible error over the Bayes error. In practical cases we usually have an added error because the optimal threshold providing the Bayes error is nearly impossible to estimate. = P(ω 1 P{x R 2 ω 1 } + P(ω 2 P{x R 1 ω 2 } = = P(ω 1 p(x ω 1 dx + P(ω 2 p(x ω 2 dx R 2 R 1 Machine Learning, Part 2, March 2017 Fabio Roli 13

14 Basic concepts of error probability With more than two classes (c>2, it is more convenient to compute the error probability by the probability of correct classification: c P(correct = P{x R i,ω i } = P i P{x R i / ω i } = P i p(x ω i dx i=1 P(error = 1 P(correct c i=1 In general, it is easy to see that the above computation can be very difficult, as it requires multidimensional integrals and involves density functions with very complicated analytical forms. The computation is easy only for Gaussian densities. Try to believe! c i=1 R i Machine Learning, Part 2, March 2017 Fabio Roli 14

15 MAP rule for error probability minimization Let us show how the MAP rule allows to minimize the error probability. x ω P( ω x > P( ω x i j, i=1,...,c i i j Error probability can be written as follows: c i=1 P(error = P(error / ω i P(ω i With P(error / ω i = p(x / ω i dx C[R i ] C[ R i ] = R j c C [R i ] is the union of the decision regions different from R i : j= 1, j i Machine Learning, Part 2, March 2017 Fabio Roli 15

16 MAP rule for error probability minimization Therefore, error probability can be rewritten as: c P(error = p(x / ω i P(ω i dx = i=1 C[R i ] c c = P(ω i 1 p(x / ω i dx = 1 P(ω i=1 i p(x / ω i dx i=1 R i R i It is easy to see that minimization of the error probability is equal to the maximization of term related to the probability of correct classification: c i= 1 P( ω p( x / ω dx i R i ØBut this implies that decision regions R i should be chosen in order to maximize P(ω j x = p(x ω j P(ω j, that proves that MAP decision rule minimizes the error probability. Machine Learning, Part 2, March 2017 Fabio Roli 16 i

17 A quick note on error-probability upper bounds As exact computation of error probability is often nearly impossible, some upper bounds on thr error probability have been proposed: Chernoff bound Bhattacharyya bound üstudents interested in further details are referred to Chapter 2.8 of the book Pattern Classification, by R. O. Duda, P. E. Hart, and D. G. Stork, John Wiley & Sons, 2000 However, the above bounds have been designed for Gaussian density functions. They are not reliable for non-gaussian functions. They often are loose bounds, useful for practical applications only if the error value provided by the upper bound is however acceptable (knowing that error is less than k% must be enough for your practical purposes!. ØWe see later experimental techniques to assess error probability of a pattern classifier (Part 7. Machine Learning, Part 2, March 2017 Fabio Roli 17

18 Bayesian decision theory Now we generalize the standard setting by: 1 Allowing the use of more than one feature ( feature space : x = (x 1, x 2,., x d, feature vector with d elements. 2 Allowing more than two classes. 3 Introducing the concept of risk, as a generalization of the concept of error probability. 4 Allowing the rejection option, that is, allowing not making any decision if the decision is too risky/costly, and we can postpone it, eventually asking for human decision/intervention. Machine Learning, Part 2, March 2017 Fabio Roli 18

19 From error to risk It is easy to see that the following expression of error probability assumes that all the errors are equal, that is, all the terms related to probabilities of error ( j i have the same cost equal to 1. c P( errore / x ω = P( ω x = 1 P( ω x i j i j= 1, j i For some applications, the above costs need to be different. It is easy to see that if costs are different, then the above equation is not more an error probability. The notion of risk function has been defined: R(ω i / x c R( ω x = w P( ω x i ij j j= 1 The weights w ij are the costs of errors. Later we denote them as λ(α i / α j Note that che w ii can be negative ( gain Machine Learning, Part 2, March 2017 Fabio Roli 19

20 Minimum risk theory MAP rule does not consider the different costs associated to the different errors In some applications this is not a valid choice because errors can bring different losses, and, therefore, they should have different costs. The minimum risk theory (also called utility theory in economy takes into account both probabilities and costs of actions/decisions. Problem formulation: Data classes: Ω = {ω 1, ω 2,..., ω c }; Actions/Decisions: A = {α 1, α 2,..., α a }; ØIn the most of cases, we consider action=classification, that is, the action is the decision about the class of the pattern. Machine Learning, Part 2, March 2017 Fabio Roli 20

21 Minimum risk theory The costs (losses associated to the different actions given possible classifications are defined by the loss matrix Λ: Λ = λ(α 1 ω 1 λ(α 1 ω 2 λ(α 1 ω c λ(α 2 ω 1 λ(α 2 ω 2 λ(α 2 ω c λ(α a ω 1 λ(α a ω 1 λ(α a ω c The function λ(α i ω j is a loss function denoting the loss/cost associated to the action/decision α i when the data class is ω j An example of loss matrix for intrusion detection in computer networks Ω={ω 1 = malicious traffic, ω 2 =normal traffic}; A= {α 1 = server off, α 2 = server on}; Λ = 0 λ 12 λ 21 0 Bank computer network: λ 12 << λ 21 Machine Learning, Part 2, March 2017 Fabio Roli 21

22 An example of loss matrix for intrusion detection systems Normal Traffic User to Root Attack Remote to Local Attack Probing Attack Denial of Service Attack Normal Traffic User to Root Attack Remote to Local Attack Probing Attack Denial of Service Attack Machine Learning, Part 2, March 2017 Fabio Roli 22

23 Minimum risk decision rule Let us assume that the action α i is candidate for execution given that the pattern x has been observed. We don t know the true class of the pattern x, but let us assume that we know P(ω j x. We can evaluate the conditional risk associated to the action α i : c j=1 R(α i x = λ(α i ω j P(ω j x ØThe conditional risk can be regarded as an average loss/cost. Minimum risk decision rule: = E ω Ω {λ(α i ω x} x! a j Û R(a j x R(a i x "i = 1, 2,..., R x α R( α x < R( α x i j, i=1,...,a i i j Given the pattern x, we choose the action α i with the minimum risk. This is the optimal decision rule for any pattern x. Machine Learning, Part 2, March 2017 Fabio Roli 23

24 Minimum risk for binary classification Consider a two class problem and the special case where action=classification Therefore, α i correspond to assign the pattern to the class ω i Let be λ ij = λ(ω i ω j the loss we incur assigning the pattern to the class ω i when the true class is ω j The conditional risk can be written as follows: R( α R( α 1 2 x = λ x = λ P( ω 1 P( ω 1 x + λ 12 x + λ 22 P( ω 2 P( ω 2 x x The minimum risk decision rule is: Øx ε ω 1 se R(ω 1 x < R(ω 2 x, else x ε ω 2 Machine Learning, Part 2, March 2017 Fabio Roli 24

25 Minimum risk for binary classification In terms of posterior probabilities: x ω 1 if ( λ 21 λ 11 P(ω 1 x > ( λ 12 λ 22 P(ω 2 x According to the Bayes rule: x ω 1 if ( λ 21 λ 11 p(x ω 1 P(ω 1 > ( λ 12 λ 22 p(x ω 2 P(ω 2 It is reasonable to assume that λ 21 >λ 11. If we make explicit the ratio likelihood p( x ω1 p( x ω2, the above rule can be rewritten as: The true class is ω 1 if the likelihhod ratio is higher than a threshold θ that does not depend on x ( ( x ω 1 if l(x = p(x ω 1 p(x ω 2 > λ λ λ 21 λ 11 P(ω 2 P(ω 1 = θ Machine Learning, Part 2, March 2017 Fabio Roli 25

26 Minimum error and loss matrix 0-1 The action α i corresponds to the assignment of the pattern x to the class ω i. In some cases, a simple loss function can be appropriate: 0 i = j λ( αi, ω j = i, j = 1,..., c 1 i j c j=1 R(α i x = λ(α i ω j P(ω j x j i = P(ω j x = 1 P(ω i x Loss matrix 0-1 or zero-one loss function All the errors have the same cost equal to 1. The risk is exactly equal to the error probability: Machine Learning, Part 2, March 2017 Fabio Roli 26

27 Minimum error classification Using a 0-1 loss function, the minimum risk decision rule become the classical MAP (maximum a posteriori probability: Assign x to ω i se P(ω i x > P(ω j x j i Rewriting in terms of the likelihood ratio: Given λ 12 λ 22 λ 21 λ 11. P(ω 2 P(ω 1 = θ λ ; x ω 1 if p(x ω 1 p(x ω 2 > θ λ Examples If Λ = then θ = P(ω 2 λ P(ω 1 = θ a If Λ = then θ = 2P(ω 2 λ P(ω 1 = θ b Machine Learning, Part 2, March 2017 Fabio Roli 27

28 Minimum error classification and decision regions λ θ = λ If errors for class ω 1 are more costly the threshold is more tight and the decision region R 1 becomes smaller Class ω 1 should have a likelihhod as higher as λ 12 > λ 21 We have the threshold θ b when P(ω 1 =P(ω 2 and We have θ b when λ > λ The region R 1 decreases when λ > λ = λ21 = 1 Machine Learning, Part 2, March 2017 Fabio Roli 28 λ

29 A remark on likelihood ratio test and decision regions Likelihood ratio test transforms the decision problem within a d- dimensional feature space into a mono-dimensional test against the threshold value θ, without any need to know exactly and explicitly the decision regions. The decision regions could be very complex manifolds, but we do not need their exact computation to classify the pattern x. To classify the pattern x is sufficient to compute the likelihood ration l(x and compare it against the threshold value θ. Machine Learning, Part 2, March 2017 Fabio Roli 29

30 Some remarks on the use of minimum risk decision rule in security problems Ω={ω 1 = malicious traffic, ω 2 =normal traffic}; A= {α1= server off, α2= server on}; Loss matrix: 0 λ 12 Λ = λ 21 0 Minimum risk decision rule: Server off if l(x = p(x attack p(x normal > λ 12 λ 21 P(normal P(attack = θ We assume: λ 12 << λ 21 ØHow should we set the costs? ØHow should we estimate priors? Machine Learning, Part 2, March 2017 Fabio Roli 30

31 Some remarks on the use of minimum risk decision rule in security problems ØHow should we set the costs? ØHow should we estimate priors? If we rewrite the decision rule as follows: P(attack = 1 λ * θ = λ 12 λ 21 P(normal P * l(x = p(x /attack p(x / normal > 1 λ * P * = θ The relationship between priors and costs/losses become much more clear We note that estimate P* helps for making a decision about the costs As much as P* becomes greater, as much as λ 12 << λ 21 if I want that my classifier detects attacks If I cannot evaluate P* in a realiable way? We see later the Minimax rule. Machine Learning, Part 2, March 2017 Fabio Roli 31

32 Overall risk minimization The action we do depends on the pattern x by the function α(x, and the overall risk R can be written as: ( α( x x R = R p( x dx a c j=1 R = λ(α i ω j p(x / ω j P(ω j dx i=1 We can show that the minimum risk rule applied to any pattern x minimizes the overall risk over the entire feature space considered. We introduce some notable concepts (false and miss alarm probability/rate, and we introduce the formulation of a two class problem as a hypothesis testing problem ØTo analyse the overall risk minimization, let us consider the personal identity verification problem by fingerprint recognition. Machine Learning, Part 2, March 2017 Fabio Roli 32

33 Fingerprint recognition as hypothesis testing I m Fabio Roli! Two class/hypothesis test: Genuine Impostor Fabio Roli We use d features, called minutiae, extracted from the fingerprint image Minuatiae points of two fingerprints are compared (minutiae verification A. Neri B. Verdi C. Bianchi F. Roli E. Gialli Database of fingerprints MINUTIAE EXTRACTION MINUTIAE VERIFICATION Verification Score Decision is based on the score, taking real values between 0 and 1 Machine Learning, Part 2, March 2017 Fabio Roli 33

34 Decision regions Impostor Genuine An example of sample distributions of the scores of genuine and impostor users Decison regions R 1 e R 2 are defined according to the distributions ØLet us assume that p(s/genuine and p(s/impostor are known Let us assume that the space S of the score values has been subdivided into the regions R 1 and R 2 so that: If s belongs to R 2 then the claimed identity is verified (genuine user Else (s ε R 1 the claimed identity is rejected (impostor user Machine Learning, Part 2, March 2017 Fabio Roli 34

35 False and miss alarm rates The optimal decision should be associated to regions R 1 e R 2 that minimize the expected overall risk: R = E{risk} = λ 11 P( impostor / impostorp(impostor + λ 12 P( impostor / genuinep(genuine + +λ 21 P( genuine / impostorp(impostor + λ 22 P( genuine / genuinep(genuine P(impostor impostor = P(genuine impostor = p(s impostords = 1 P FA, P(impostor genuine = p(s genuineds = P FA R 1 p(s impostords = P MA, P(genuine genuine = p(s genuineds = 1 P MA R 2 R = λ 11 ( 1 P FA P imp + λ 12 P FA P gen + λ 21 P MA P imp + λ 22 (1 P MA P gen P FA : False Allarm Rate (or False Reject Rate, FRR P MA : Miss Allarm Rate (or False Acceptance Rate, FAR P gen =P(genuine; P imp =P(impostor; Machine Learning, Part 2, March 2017 Fabio Roli 35 R 2 R 1

36 Overall risk minimization Rewriting the expected overall risk: R = λ 11 p(s impostords P imp + λ 12 p(s genuineds P gen + R 1 +λ 21 p(s impostords P imp + λ 22 p(s genuineds P gen since R 2 R 2 R 1 p(s impostords = 1 p(s impostords p(s genuineds = 1 p(s genuineds R 2 R = λ 11 p(s impostords P imp + λ 12 p(s genuineds P gen + R 1 R 1 R 1 +λ 21 P imp λ 21 P imp p(s impostords + λ 22 P gen λ 22 P gen p(s genuineds = R 1 λ 21 P imp + λ 22 P gen + P imp (λ 11 λ 21 p(s impostor + P gen (λ 12 λ 22 p(s genuineds R 1 ØThe integrand should be negative in order to minimize the risk! R 2 R 1 R 1 Machine Learning, Part 2, March 2017 Fabio Roli 36

37 Overall risk minimization P imp (λ 11 λ 21 p(s impostor + P gen (λ 12 λ 22 p(s genuine < 0 P imp (λ 21 λ 11 p(s impostor > P gen (λ 12 λ 22 p(s genuine p(s impostor p(s genuine > P gen (λ 12 λ 22 P imp (λ 21 λ 11 ØThe above is the minimum risk decison rule to be used for any pattern s ØThis proves that that the minimum risk decison rule, used for any pattern s, minimizes the expected overall risk. Machine Learning, Part 2, March 2017 Fabio Roli 37

38 Minimax decision rule In some real cases, priors can change over time, or their estimation can be difficult (intrusion detection, spamming, ecc.. We need a decision rule that can work even if the priors are not known An approach (used in many engineering problems is based on the worst-case design We design the classifier in order to minimize the risk in the worst-case in terms of priors variation Minimax: Mimize the Maximum risk This is worst-case design. The design is very conservative/pessimistic, and therefore, performance are not the optimal ones, they are optimal only for the worst case. Machine Learning, Part 2, March 2017 Fabio Roli 38

39 Machine Learning, Part 2, March 2017 Fabio Roli 39 Minimax decision rule We know that P 2 =1- P 1 and ( ( dx x p P x p P dx x p P x p P R R R = 2 1 ( ( ( ( ω λ ω λ ω λ ω λ = = ( ( ω ω P P P P ( 1 ( p x dx p x dx R R ω = ω ( ( ( ( ( = R R R ( ( ( dx x p dx x p P dx x p P R ω λ λ ω λ λ λ λ ω λ λ λ Let us consider a two class problem and two regions (non known at the beginning R 1 and R 2. The overall risk can be written as: Note: we express R as a function of P 1 and simplify the equation

40 Minimax R( P ( 1 = λ 22 + λ12 λ 22 p( x ω 2 dx + P R 1 R2 R 1 Note that costs and priors identifies the threshold θ. The threshold and the density functions defines the regions R 1 e R 2 and the overall risk. l( x = p( x / ω1 p( x / ω 2 ( λ 12 λ22 ( λ λ R = x : l( x > θ, R = x : l( x < θ { } { } 1 2 ω > < ω 1 2 P( ω2 P( ω = θ Varying P 1 changes the threshold θ(p 1, the decision regions, and the overall il risk. This does not allow to control the risk! This is the key point! ØKey issue: P 1 can change! I want to estimate the risk I could incur, that is, I want to evaluate it not depending on P 1 variations. Machine Learning, Part 2, March 2017 Fabio Roli

41 Minimax R = p( x ω2 dx R2 R ( P λ + ( λ λ 1 P R mm R 1, minimax risk The above equation shows that after identifying the decision regions (R 1 and R 2, overall risk is a linear function of P 1. If R 1 e R 2 makes zero the term in square brackets, then the overall risk does not depend on priors! Key point!... This is the minimax solution, and the minimax risk, R mm, is: R mm = λ 11 = λ Check this equality by exercise! ( λ λ 12 ( λ 21 λ11 R 2 22 Machine Learning, Part 2, March 2017 Fabio Roli 41 R p( x 1 p( x ω dx 1 ω dx = 2

42 Minimax: risk as a function of P(w 1 From the formula of R(P 1 it is easy to see that: P 1 = 0 implies that region R 1 is empty, therefore R=λ 22 P 1 = 1 implies that region R 2 is empty, therefore R=λ 11 Machine Learning, Part 2, March 2017 Fabio Roli 42

43 Minimax: risk linear function Let us assume that we have P(ω 1 =P 1 *=0.6, and therefore the risk associated to decision regions R 1 (P 1 * and R 2 (P 1 * is R(P 1 * If P 1 changes over time, the above equation shows that risk is a linear function, with decision regions identified by P 1 *=0.6. Machine Learning, Part 2, March 2017 Fabio Roli 43

44 Minimax: linear risk function To control the risk we choose P 1 * in order to have the red linear function with slope=0 The related regions R 1 (P 1 * e R 2 (P 1 * provides R max We are minimizing the maximum risk that we can incur when priors changes (Minimax. Indeed any other rule can provide higher risk when the prior P 1 * changes. Machine Learning, Part 2, March 2017 Fabio Roli 44

45 Minimax In order to identify the regions R 1 (θ e R 2 (θ associated to the Minimax line: λ11 λ 22 + λ21 λ 11 p( x ω 1 dx + λ22 λ 12 p( x ω 2 dx = 0 R2 R 1 ( ( ( ØWe should identify the regions that meet the above equation. ØThis can be done in closed forms in a few cases, but it is very difficult in the most of real cases. Machine Learning, Part 2, March 2017 Fabio Roli 45

46 Minimax straight line ( λ ( ( 11 λ 22 + λ21 λ 11 p( x ω 1 dx + λ22 λ 12 p( x ω 2 dx = 0 R2 R 1 ØThe above integrals are the error probabilities associated to the two classess (they are the so called P FA e P MA, it is easy to see that these errors can be controlled by θ. We could look for a threshold θ* which meet the above equation. This threshold value identifies R 1 (θ* and R 2 (θ*. R = x : l( x > θ *, R = x : l( x < θ* { } { } 1 2 Note that: θ ( ( 2 * λ λ P ω λ λ P( ω ( We are looking for the optimal threshold θ* without using the priors and the costs! Machine Learning, Part 2, March 2017 Fabio Roli 46

47 Minimax line with costs 0-1 ØFor loss matrix 0-1 p( x ω dx p( x ω dx = 0 p( x ω dx = p( x ω dx R R R R The threshold θ* is the one that makes equal the two error probabilities (P FA =P MA. In fingerprint recognition it is called EER (Equal Error Rate threshold. Impostor Genuine Note that the threshold θ* that makes P FA =P MA is not the optimal Bayesian threshold. Indeed we are using the Minimax rule, that is not the ideal minimum risk rule. Machine Learning, Part 2, March 2017 Fabio Roli 47

48 Minimax line: general case λ11 λ 22 + λ21 λ 11 p( x ω 1 dx + λ22 λ 12 p( x ω 2 dx = 0 R2 R 1 ( ( ( ØIn general, the empirical computation of the minimax line demands for a classifier that allows the control of the regions R 1 e R 2. ØThis is not always doable ØIf the loss matrix has values 0-1 we can change the parameters of the classifier so that: R p( x ω dx = p( x ω dx 1 2 R 2 1 ØThat, we tune the parameters in order to make equal the two error probabilities Machine Learning, Part 2, March 2017 Fabio Roli 48

49 Neyman-Pearson decision rule ØIf we do not know priors and costs, we can use the Neyman-Pearson decision rule. ØThis rule is used in applications where we have a constraint on the false alarm rate and we want to minimize the miss alarm rate (e.g., in radar applications or biometric recognition We fix a given P FA (false alarm rate, P FA = α. The Neyman-Pearson rule minimizes P MA with P FA =α. The minimization problem is formulated as follows: F = P MA + λ(p FA α = p(x ω 1 dx + λ p(x ω 2 dx R 1 R 2 = p(x ω 1 dx + λ 1 p(x ω 2 dx α R 2 R 2 = = λ(1 α + [ p(x ω 1 λp(x ω 2 ]dx R 2 α = Machine Learning, Part 2, March 2017 Fabio Roli 49

50 Neyman-Pearson rule for a two class problem Problem: identify the region R 2 that solves the above constrained minimization. We can disregard the constant term, and the minimization problem becomes: min R 2 R P FA = The terms in the squared brackets are positive: so we have the minimum when the integrand is negative for any x ε R 2. So R 2 = {x ε R: p(x ω 1 < λ p(x ω 2 } = {x ε R: l(x < λ}. Therefore the decision rule is: R 2 [ p(x ω 1 λp(x ω 2 ]dx R 1 p(x ω 2 dx = α l(x = p(x ω 1 p(x ω 2 ω 1 λ ω 2 Machine Learning, Part 2, March 2017 Fabio Roli 50

51 Neyman-Pearson rule: threshold computation The Neyman-Pearson is based on a likelihood ratio test. The threshold of this test comes from the constraint you have. As P FA = P FA (λ, the constraint P FA = λ identifies the values of λ. More precisely, introducing the function l=l(x (funzione di x, we have: where P FA = P{l x + λ * + ( > λ ω 2 } = p l (l ω 2 dl p l (l ω 2 dl = α = α λ = λ * λ ØIn practice, the following costraint is met by experiments: PFA α Machine Learning, Part 2, March 2017 Fabio Roli 51

52 Neyman-Pearson: biometric example Genuine Impostor In this case it is easy finding the threshold that provides The threshold value is identified by experiments PFA α Machine Learning, Part 2, March 2017 Fabio Roli 52

53 Neyman-Pearson rule and ROC curve With Neyman-Pearson rule, varying the threshold λ we obtain different P FA and different values of probability of detection P D (P D =1- P MA. To assess performance for varying threshold values one can use the ROC (Receiver Operating Characteristic curve. The ROC curve shows P D as a function of P FA for different threshold. P D 1 üthe ROC curve allows to analyse the trade-off between the false and miss alarm rates. üa good ROC curve should have small P FA and large P D value. üthe ideal ROC curve is a rectangular curve. ROC examples 1 P FA Machine Learning, Part 2, March 2017 Fabio Roli 53

54 Decision with reject option Even if one would be able to obtain the minimum Bayes error (threshold x B in figure, this error rate could be not acceptable for a given application Example: screening for medical diagnosis. I could demand for a false negative rate equal to zero. It is easy to see that this happens if üerrors are very costly, so we must protect against errors, limiting the error rate below a given threshold üan obvious way to limit decision errors is not making a decision or postponing decisions ØTo reduce error probability one can omit or defer decisions (reject option ØOmitting decisions is a rationale and doable option supposed that decisions are taken by other ways (e.g., by humans Machine Learning, Part 2, March 2017 Fabio Roli 54

55 Classification with reject option x R d CLASSIFIER x ω i Reject It is easy to see that rejection option demands for an additional class with respect to the standard formulation of the classification problem: Set of classes: Ω = {ω 1, ω 2,..., ω c }; Set of actions/decisions: A = {α o, α 1, α 2,..., α a }; If our action is a classification: A = {ω o, ω 1, ω 2,..., ω c }; ØWe have an additional class: ω o, the class containing the rejected samples Machine Learning, Part 2, March 2017 Fabio Roli 55

56 Loss matrix and minimum risk with reject option The loss matrix Λ, with size (c+1x c, is: Λ = λ(ω 0 ω 1 λ(ω 0 ω 2 λ(ω 0 ω c λ(ω 1 ω 1 λ(ω 1 ω 2 λ(ω 1 ω c λ(ω c ω 1 λ(ω c ω 1 λ(ω c ω c The minimum risk decision criterion is: x ω R( ω x < R( ω x i j, i=0,1,..,c i i j The main difference w.r.t. the case without reject option is that the minimum risk decision could be a rejection, if: R( ω x < R( ω x j 0 0 Machine Learning, Part 2, March 2017 Fabio Roli 56 j

57 Binary classification with equal costs Let us consider a binary classification with equal costs: Λ = λ r λ c λ c λ r λ e λ e = λ(ω 0 ω 1 λ(ω 0 ω 2 λ(ω 1 ω 1 λ(ω 1 ω 2 λ(ω 2 ω 2 λ(ω 2 ω 1 Reject cost = λ r Error cost = λ e Cost of correct classification = λ c (usually λ c =0 According to the minumum risk criterion, we have three decision regions: { : ( x ( 0} j x R = x R R ω < R ω j 0 0 { : ( x ( 1} j x R = x R R ω < R ω j 1 1 { : ( x ( 2} j x R = x R R ω < R ω j 2 2 Machine Learning, Part 2, March 2017 Fabio Roli 57

58 Binary classification with equal costs The overall risk R can be written as: e c r [ (, (, ] R = λ P rigetto x ω + P rigetto x ω [ P( errore, x P( errore, x ] +λ ω + ω [ P( corretto, x P( corretto, x ] +λ ω + ω = 1 2 = λr P( ω1 p( x / ω 1 dx + P( ω2 p( x / ω 2 dx + + R0 R 0 +λe P( ω1 p( x / ω 1 dx + P( ω2 p( x / ω 2 dx + + R2 R1 +λc P( ω1 p( x / ω 1 dx + P( ω2 p( x / ω 2 dx + R1 R 2 Machine Learning, Part 2, March 2017 Fabio Roli 58

59 The error-reject trade-off From the previous equation we obtain: R = λ r P(reject + λ e P(error + λ c P(correct being P(reject + P(error + P(correct = 1 we have R = (λ r λ c P(reject + (λ e λ c P(error ØThe above formulation of the overall risk clearly shows that a given value of the risk can be obtained by a trade-off between error probability and reject probability: error-reject trade-off ØThe trade-off is also clearly shown by the relation of the three error probabilities: P(correct=1-P(reject-P(error ØThe above equation shows that we can reduce error probability by increasing reject probability, and vice versa. Machine Learning, Part 2, March 2017 Fabio Roli 59

60 Error-reject trade-off and Chow s rule Requirements of practical applications are often given as: minimize error probability with P(reject lower than r (e.g., minimize error with reject<15% These requirements can be satisfied by the Chow s rule (C.K. Chow, 1957, 1970, that is the optimal rule with reject option: if max P(ω i i / x T x ω i otherwise reject x with T= λ e λ r λ e λ c T is the reject threshold T ε [0..1] because λ λ For T=0 (λ e =λ r we have the classical MAP rule ØWe can show (C.K. Chow, 1957 that Chow s rule minimizes error probability (that is, maximize classification accuracy for any value of the reject probability. ØIt is easy to see that Chow s rule minimizes error by rejecting patterns for which the classification is not reliable enough. c r Machine Learning, Part 2, March 2017 Fabio Roli 60

61 Example of use of the Chow s rule P(ω1 x P(ω2 x T x R 1 R 0 R 2 Two classes with Gaussian distribution The reject threshold T identifies the reject region R 0 This example clearly shows that error probability can be reduced by increasing the reject threshold T. Error becomes zero when the region R 0 contains all the patterns which are misclassified. Machine Learning, Part 2, March 2017 Fabio Roli 61

62 Examples of error-reject trade-off Error Hypothetical trade-off curve T increases, then the rejection incresaes and error decreases (error-reject trade-off Reject Examples of accuracy-rejection for different OCR (Optical Character Recognition algorithms Machine Learning, Part 2, March 2017 Fabio Roli 62

63 Discriminant functions, decision surfaces/regions So far we represented a classifier as a machine that takes as input the pattern x and assigns it to a class according to the minimu risk theory: x x ω R( ω x < R( ω x i j, i=1,...,c i i j ω i An alternative representation of a pattern classifier is in terms of a set of discriminant functions g i (x, i=1,..,c. We assign x to the class ω i if g i (x> g j (x, j i Machine Learning, Part 2, March 2017 Fabio Roli 63

64 Machine Learning, Part 2, March 2017 Fabio Roli 64 In general, we can consider g i (x=-r(ω i x ; the discriminat function is aimed to minimize the risk. If we want to minimize the error probability: g i (x=p(ω 1 /x We have many possible choices for g i (x; we can replace g i (x with f(g i (x, where f( is a increasing monotonic function. In particolar, if we want to minimize the error probability all the following choices are appropriate: ( ( ( ln ( ln ( ( ( ( ( ( ( ( ( ( 1 i i i i i i c j j j i i i i P p g P p g P p P p P g ω ω ω ω ω ω ω ω ω + = = = = = x x x x x x x x Discriminant functions, decision surfaces/regions

65 Discriminant functions, decision surfaces/regions Discriminat functions subdivide the feature space into c decision regions R 1 R c. If g i (x> g j (x for any j i, then x ε R i, and it is assigned to the class ω i. Decision boundaries among regions are specified by g i (x=g j (x, considering the two discriminant functions exhibiting maximum values Bi-dimensional example. Two classes with Gaussian distributions. Quadratic decision surfaces. Region R 2 is not simply connected. Machine Learning, Part 2, March 2017 Fabio Roli 65

66 Machine Learning, Part 2, March 2017 Fabio Roli 66 Discriminant functions for binary classification For a two-class problem, we can use a single discriminant function g(x=g 1 (x- g 2 (x Se g(x>0 then ω 1, otherwise ω 2 The following forms of the discriminant function can be used for a two-class problem: A pattern classifier based on one of the above discriminat functions for a two-class task is called dicotomic classifier. + = = ( ( ln ( ( ln ( ( ( ( ω ω ω ω ω ω P P p p g P P g x x x x x x

67 References ØSections 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, Pattern Classification, R.O. Duda, P. E. Hart, and D. G. Stork, John Wiley & Sons, 2000 ØChapter 1, Statistical Pattern Recognition, Andrew Webb, John Wiley & Sons, 2002 ØC.K. Chow, On optimum error and reject trade-off, IEEE Trans. on Information Theory 16 ( Machine Learning, Part 2, March 2017 Fabio Roli 67

Pattern Classification

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors