Part 2 Elements of Bayesian Decision Theory

Size: px
Start display at page:

Download "Part 2 Elements of Bayesian Decision Theory"

Transcription

1 Part 2 Elements of Bayesian Decision Theory Machine Learning, Part 2, March 2017 Fabio Roli 1

2 Introduction ØStatistical pattern classification is grounded into Bayesian decision theory, therefore, knowing the elements of this theory is a must for anybody wants to work in this field. ØBayesian theory assumes that decision making problems are formulated in probabilistic terms. Probabilities are known or should be estimated. ØBayes decision theory allows to take into account both probability and risk of decisions. Making a rational decision means to take into account both the probability and the risk (or the utility associated to the decision. ØWe start our presentation of elements of Bayesian decision theory assuming that all the probabilities involved in the problem considered are known. Machine Learning, Part 2, March 2017 Fabio Roli 2

3 First thing to know: the MAP decision rule ØWe present the fundamental decision rule of Bayesian decision theory using the example of the salmon/sea bass classification introduced in Part 1. Let us assume that an image segmentation module has already extracted the shape of the fishes as shown in the figure, and a feature extraction module has characterized each shape/pattern with one feature: the average lightness of the shape. Decision problem: we want to assign each shape/pattern to one of the two classes considered (salmon, sea bass. Machine Learning, Part 2, March 2017 Fabio Roli 3

4 First thing to know: the MAP decision rule We assume that we cannot know deterministically which is the class (salmon or sea bass of the next fish incoming on the conveyor belt. So the problem must be formulated in probabilistic terms. Next incoming fish can be a salmon or a sea bass with a given probability. Bayes decision theory formalizes this situation with the concept of state of nature (usually called class in pattern recognition. In our example, we have two states-of-nature/classes: ω 1 and ω 2 Let ω =ω 1 or ω =ω 2 be the variable that identifies the class, where ω is a random variable. ØThe two classes could have the same prior probability: P(ω 1 = P(ω 2 P(ω 1 + P(ω 2 = 1 (we have just two species of fish Machine Learning, Part 2, March 2017 Fabio Roli 4

5 The MAP decision rule If we should make a decision without being able to see the incoming fish, the only rational decision would be: Assign the fish to ω 1 if P(ω 1 > P(ω 2, else assign the fish to ω 2 This blind (a priori decision works well only if one class is much more likely, e.g., P(ω 1 >> P(ω 2 (and, as we see later, the two decisions have the same risk. ØIn general, we must see the pattern to make a rational decision according to Bayesian theory. We must see the fish and characterize it with some features. For example, the average lightness of the pattern. As fishes incoming on the belt will have random lightness values, the lightness feature x should be treated as a random variable with conditional distribution p(x ω i. Machine Learning, Part 2, March 2017 Fabio Roli 5

6 An example of a mono-dimensional p(x ω i p(x ω i is the class-conditional probability density function If x is the average lightness of the image region associated to a given fish of class ω i, then the difference between the functions p(x ω i characterizes the expected lightness difference between the two fish species. Machine Learning, Part 2, March 2017 Fabio Roli 6

7 Bayes decision rule Let us assume to know the two priors P(ω j and the two classconditional density functions p(x ω j, j=1,2. If we measure the average lightness x of the incoming fish, taking into account the probabilistic nature of the problem, the most rationale decision rule is based on the probability: P(ω j, x = P(ω j x p(x = p(x ω j P(ω j That we can rewrite as the Bayes decision rule (MAP, maximum a posteriori, decision rule : P(ω j x = p(x ω j P (ω j / p(x Posterior = (Likelihood * Prior / Evidence Note that: 2 p( x = p( x ω P( ω j= 1 j Machine Learning, Part 2, March 2017 Fabio Roli 7 j

8 An example of mono-dimensional P(ω i x P(ω i x with P(ω 1 =2/3 e P(ω 2 =1/3 Machine Learning, Part 2, March 2017 Fabio Roli 8

9 The MAP decision rule The MAP, maximum a posteriori probability, criterion is the most rationale decision rule for the considered probabilistic setting: If P(ω 1 x > P(ω 2 x then is most rationale to assign x to ω 1 If P(ω 1 x < P(ω 2 x then is most rationale to assign x to ω 2 This rule is the most rationale because it minimizes the error probability for any given x: P(error x = P(ω 1 x if we assign x to ω 2 P(error x = P(ω 2 x if we assign x to ω 1 We can prove that MAP rule also minimizes the average error: + + P( errore = P( errore, x dx = P( errore x p( x dx Machine Learning, Part 2, March 2017 Fabio Roli 9

10 Likelihood ratio test and ML rule We can reformulate the MAP rule as follows: If p(x ω 1 P(ω 1 > p(x ω 2 P(ω 2 then assign x to ω 1 else assign x to ω 2 Likelihood ratio test l( x p( x / ω1 p( x / ω Note: the evidence p(x does not matter! = 2 We match the likelihood ratio l(x against a threshold θ not depending on x P( ω2 P( ω Note: the likelihood ratio is matched against the ratio between priors Two special cases: If p(x ω 1 =p(x ω 2, then the decision depends only on priors If P(ω 1 =P(ω 2, then the decision depends only on likelihoods (ML, Maximum Likelihood, decision rule Machine Learning, Part 2, March 2017 Fabio Roli 10 ω > < ω = θ

11 The concept of decision regions The likelihood ratio test is defined by l(x and the threshold θ. This test identifies two decision regions R 1 e R 2 in the feature space R (here we consider a single feature x. ØR 1 = {x ε R: l(x > θ} and R 2 = {x ε R: l(x < θ} (if l(x = θ then x can be assigned randomly to R 1 or R 2. Given the probability density functions p(x ω 1 and p(x ω 2, the regions R 1 and R 2 are identified by the threshold θ. R 1 x =x(θ R 2 A Gaussian example R 1 = R 1 (θ R 2 = R 2 (θ Machine Learning, Part 2, March 2017 Fabio Roli 11

12 MAP decision rule with more than two classes The MAP decision rule with more than two classes is: x ω P( ω x > P( ω x i j, i=1,...,c i i j Ø The Likelihood ratio test is defined accordingly. ØIt is easy to see that we should have multiple thresholds θ st defined according to the following rule : P( ω x > P( ω x s, t i, s t i=1,...,c s P( ω x > P( ω x t i i ØIn this example, we have three classes and two thresholds θ 12 e θ 23. P(ω i /x θ 12 θ 23 x Machine Learning, Part 2, March 2017 Fabio Roli 12

13 Basic concepts of error probability For the two class case: P(error = P{x R 2, ω 1 } + P{x R 1, ω 2 } = The optimal threshold x=x B is the Bayesian threshold providing the minimum error, called Bayes error. In the figure x=x* is a suboptimal threshold, that brings to an added (reducible error over the Bayes error. In practical cases we usually have an added error because the optimal threshold providing the Bayes error is nearly impossible to estimate. = P(ω 1 P{x R 2 ω 1 } + P(ω 2 P{x R 1 ω 2 } = = P(ω 1 p(x ω 1 dx + P(ω 2 p(x ω 2 dx R 2 R 1 Machine Learning, Part 2, March 2017 Fabio Roli 13

14 Basic concepts of error probability With more than two classes (c>2, it is more convenient to compute the error probability by the probability of correct classification: c P(correct = P{x R i,ω i } = P i P{x R i / ω i } = P i p(x ω i dx i=1 P(error = 1 P(correct c i=1 In general, it is easy to see that the above computation can be very difficult, as it requires multidimensional integrals and involves density functions with very complicated analytical forms. The computation is easy only for Gaussian densities. Try to believe! c i=1 R i Machine Learning, Part 2, March 2017 Fabio Roli 14

15 MAP rule for error probability minimization Let us show how the MAP rule allows to minimize the error probability. x ω P( ω x > P( ω x i j, i=1,...,c i i j Error probability can be written as follows: c i=1 P(error = P(error / ω i P(ω i With P(error / ω i = p(x / ω i dx C[R i ] C[ R i ] = R j c C [R i ] is the union of the decision regions different from R i : j= 1, j i Machine Learning, Part 2, March 2017 Fabio Roli 15

16 MAP rule for error probability minimization Therefore, error probability can be rewritten as: c P(error = p(x / ω i P(ω i dx = i=1 C[R i ] c c = P(ω i 1 p(x / ω i dx = 1 P(ω i=1 i p(x / ω i dx i=1 R i R i It is easy to see that minimization of the error probability is equal to the maximization of term related to the probability of correct classification: c i= 1 P( ω p( x / ω dx i R i ØBut this implies that decision regions R i should be chosen in order to maximize P(ω j x = p(x ω j P(ω j, that proves that MAP decision rule minimizes the error probability. Machine Learning, Part 2, March 2017 Fabio Roli 16 i

17 A quick note on error-probability upper bounds As exact computation of error probability is often nearly impossible, some upper bounds on thr error probability have been proposed: Chernoff bound Bhattacharyya bound üstudents interested in further details are referred to Chapter 2.8 of the book Pattern Classification, by R. O. Duda, P. E. Hart, and D. G. Stork, John Wiley & Sons, 2000 However, the above bounds have been designed for Gaussian density functions. They are not reliable for non-gaussian functions. They often are loose bounds, useful for practical applications only if the error value provided by the upper bound is however acceptable (knowing that error is less than k% must be enough for your practical purposes!. ØWe see later experimental techniques to assess error probability of a pattern classifier (Part 7. Machine Learning, Part 2, March 2017 Fabio Roli 17

18 Bayesian decision theory Now we generalize the standard setting by: 1 Allowing the use of more than one feature ( feature space : x = (x 1, x 2,., x d, feature vector with d elements. 2 Allowing more than two classes. 3 Introducing the concept of risk, as a generalization of the concept of error probability. 4 Allowing the rejection option, that is, allowing not making any decision if the decision is too risky/costly, and we can postpone it, eventually asking for human decision/intervention. Machine Learning, Part 2, March 2017 Fabio Roli 18

19 From error to risk It is easy to see that the following expression of error probability assumes that all the errors are equal, that is, all the terms related to probabilities of error ( j i have the same cost equal to 1. c P( errore / x ω = P( ω x = 1 P( ω x i j i j= 1, j i For some applications, the above costs need to be different. It is easy to see that if costs are different, then the above equation is not more an error probability. The notion of risk function has been defined: R(ω i / x c R( ω x = w P( ω x i ij j j= 1 The weights w ij are the costs of errors. Later we denote them as λ(α i / α j Note that che w ii can be negative ( gain Machine Learning, Part 2, March 2017 Fabio Roli 19

20 Minimum risk theory MAP rule does not consider the different costs associated to the different errors In some applications this is not a valid choice because errors can bring different losses, and, therefore, they should have different costs. The minimum risk theory (also called utility theory in economy takes into account both probabilities and costs of actions/decisions. Problem formulation: Data classes: Ω = {ω 1, ω 2,..., ω c }; Actions/Decisions: A = {α 1, α 2,..., α a }; ØIn the most of cases, we consider action=classification, that is, the action is the decision about the class of the pattern. Machine Learning, Part 2, March 2017 Fabio Roli 20

21 Minimum risk theory The costs (losses associated to the different actions given possible classifications are defined by the loss matrix Λ: Λ = λ(α 1 ω 1 λ(α 1 ω 2 λ(α 1 ω c λ(α 2 ω 1 λ(α 2 ω 2 λ(α 2 ω c λ(α a ω 1 λ(α a ω 1 λ(α a ω c The function λ(α i ω j is a loss function denoting the loss/cost associated to the action/decision α i when the data class is ω j An example of loss matrix for intrusion detection in computer networks Ω={ω 1 = malicious traffic, ω 2 =normal traffic}; A= {α 1 = server off, α 2 = server on}; Λ = 0 λ 12 λ 21 0 Bank computer network: λ 12 << λ 21 Machine Learning, Part 2, March 2017 Fabio Roli 21

22 An example of loss matrix for intrusion detection systems Normal Traffic User to Root Attack Remote to Local Attack Probing Attack Denial of Service Attack Normal Traffic User to Root Attack Remote to Local Attack Probing Attack Denial of Service Attack Machine Learning, Part 2, March 2017 Fabio Roli 22

23 Minimum risk decision rule Let us assume that the action α i is candidate for execution given that the pattern x has been observed. We don t know the true class of the pattern x, but let us assume that we know P(ω j x. We can evaluate the conditional risk associated to the action α i : c j=1 R(α i x = λ(α i ω j P(ω j x ØThe conditional risk can be regarded as an average loss/cost. Minimum risk decision rule: = E ω Ω {λ(α i ω x} x! a j Û R(a j x R(a i x "i = 1, 2,..., R x α R( α x < R( α x i j, i=1,...,a i i j Given the pattern x, we choose the action α i with the minimum risk. This is the optimal decision rule for any pattern x. Machine Learning, Part 2, March 2017 Fabio Roli 23

24 Minimum risk for binary classification Consider a two class problem and the special case where action=classification Therefore, α i correspond to assign the pattern to the class ω i Let be λ ij = λ(ω i ω j the loss we incur assigning the pattern to the class ω i when the true class is ω j The conditional risk can be written as follows: R( α R( α 1 2 x = λ x = λ P( ω 1 P( ω 1 x + λ 12 x + λ 22 P( ω 2 P( ω 2 x x The minimum risk decision rule is: Øx ε ω 1 se R(ω 1 x < R(ω 2 x, else x ε ω 2 Machine Learning, Part 2, March 2017 Fabio Roli 24

25 Minimum risk for binary classification In terms of posterior probabilities: x ω 1 if ( λ 21 λ 11 P(ω 1 x > ( λ 12 λ 22 P(ω 2 x According to the Bayes rule: x ω 1 if ( λ 21 λ 11 p(x ω 1 P(ω 1 > ( λ 12 λ 22 p(x ω 2 P(ω 2 It is reasonable to assume that λ 21 >λ 11. If we make explicit the ratio likelihood p( x ω1 p( x ω2, the above rule can be rewritten as: The true class is ω 1 if the likelihhod ratio is higher than a threshold θ that does not depend on x ( ( x ω 1 if l(x = p(x ω 1 p(x ω 2 > λ λ λ 21 λ 11 P(ω 2 P(ω 1 = θ Machine Learning, Part 2, March 2017 Fabio Roli 25

26 Minimum error and loss matrix 0-1 The action α i corresponds to the assignment of the pattern x to the class ω i. In some cases, a simple loss function can be appropriate: 0 i = j λ( αi, ω j = i, j = 1,..., c 1 i j c j=1 R(α i x = λ(α i ω j P(ω j x j i = P(ω j x = 1 P(ω i x Loss matrix 0-1 or zero-one loss function All the errors have the same cost equal to 1. The risk is exactly equal to the error probability: Machine Learning, Part 2, March 2017 Fabio Roli 26

27 Minimum error classification Using a 0-1 loss function, the minimum risk decision rule become the classical MAP (maximum a posteriori probability: Assign x to ω i se P(ω i x > P(ω j x j i Rewriting in terms of the likelihood ratio: Given λ 12 λ 22 λ 21 λ 11. P(ω 2 P(ω 1 = θ λ ; x ω 1 if p(x ω 1 p(x ω 2 > θ λ Examples If Λ = then θ = P(ω 2 λ P(ω 1 = θ a If Λ = then θ = 2P(ω 2 λ P(ω 1 = θ b Machine Learning, Part 2, March 2017 Fabio Roli 27

28 Minimum error classification and decision regions λ θ = λ If errors for class ω 1 are more costly the threshold is more tight and the decision region R 1 becomes smaller Class ω 1 should have a likelihhod as higher as λ 12 > λ 21 We have the threshold θ b when P(ω 1 =P(ω 2 and We have θ b when λ > λ The region R 1 decreases when λ > λ = λ21 = 1 Machine Learning, Part 2, March 2017 Fabio Roli 28 λ

29 A remark on likelihood ratio test and decision regions Likelihood ratio test transforms the decision problem within a d- dimensional feature space into a mono-dimensional test against the threshold value θ, without any need to know exactly and explicitly the decision regions. The decision regions could be very complex manifolds, but we do not need their exact computation to classify the pattern x. To classify the pattern x is sufficient to compute the likelihood ration l(x and compare it against the threshold value θ. Machine Learning, Part 2, March 2017 Fabio Roli 29

30 Some remarks on the use of minimum risk decision rule in security problems Ω={ω 1 = malicious traffic, ω 2 =normal traffic}; A= {α1= server off, α2= server on}; Loss matrix: 0 λ 12 Λ = λ 21 0 Minimum risk decision rule: Server off if l(x = p(x attack p(x normal > λ 12 λ 21 P(normal P(attack = θ We assume: λ 12 << λ 21 ØHow should we set the costs? ØHow should we estimate priors? Machine Learning, Part 2, March 2017 Fabio Roli 30

31 Some remarks on the use of minimum risk decision rule in security problems ØHow should we set the costs? ØHow should we estimate priors? If we rewrite the decision rule as follows: P(attack = 1 λ * θ = λ 12 λ 21 P(normal P * l(x = p(x /attack p(x / normal > 1 λ * P * = θ The relationship between priors and costs/losses become much more clear We note that estimate P* helps for making a decision about the costs As much as P* becomes greater, as much as λ 12 << λ 21 if I want that my classifier detects attacks If I cannot evaluate P* in a realiable way? We see later the Minimax rule. Machine Learning, Part 2, March 2017 Fabio Roli 31

32 Overall risk minimization The action we do depends on the pattern x by the function α(x, and the overall risk R can be written as: ( α( x x R = R p( x dx a c j=1 R = λ(α i ω j p(x / ω j P(ω j dx i=1 We can show that the minimum risk rule applied to any pattern x minimizes the overall risk over the entire feature space considered. We introduce some notable concepts (false and miss alarm probability/rate, and we introduce the formulation of a two class problem as a hypothesis testing problem ØTo analyse the overall risk minimization, let us consider the personal identity verification problem by fingerprint recognition. Machine Learning, Part 2, March 2017 Fabio Roli 32

33 Fingerprint recognition as hypothesis testing I m Fabio Roli! Two class/hypothesis test: Genuine Impostor Fabio Roli We use d features, called minutiae, extracted from the fingerprint image Minuatiae points of two fingerprints are compared (minutiae verification A. Neri B. Verdi C. Bianchi F. Roli E. Gialli Database of fingerprints MINUTIAE EXTRACTION MINUTIAE VERIFICATION Verification Score Decision is based on the score, taking real values between 0 and 1 Machine Learning, Part 2, March 2017 Fabio Roli 33

34 Decision regions Impostor Genuine An example of sample distributions of the scores of genuine and impostor users Decison regions R 1 e R 2 are defined according to the distributions ØLet us assume that p(s/genuine and p(s/impostor are known Let us assume that the space S of the score values has been subdivided into the regions R 1 and R 2 so that: If s belongs to R 2 then the claimed identity is verified (genuine user Else (s ε R 1 the claimed identity is rejected (impostor user Machine Learning, Part 2, March 2017 Fabio Roli 34

35 False and miss alarm rates The optimal decision should be associated to regions R 1 e R 2 that minimize the expected overall risk: R = E{risk} = λ 11 P( impostor / impostorp(impostor + λ 12 P( impostor / genuinep(genuine + +λ 21 P( genuine / impostorp(impostor + λ 22 P( genuine / genuinep(genuine P(impostor impostor = P(genuine impostor = p(s impostords = 1 P FA, P(impostor genuine = p(s genuineds = P FA R 1 p(s impostords = P MA, P(genuine genuine = p(s genuineds = 1 P MA R 2 R = λ 11 ( 1 P FA P imp + λ 12 P FA P gen + λ 21 P MA P imp + λ 22 (1 P MA P gen P FA : False Allarm Rate (or False Reject Rate, FRR P MA : Miss Allarm Rate (or False Acceptance Rate, FAR P gen =P(genuine; P imp =P(impostor; Machine Learning, Part 2, March 2017 Fabio Roli 35 R 2 R 1

36 Overall risk minimization Rewriting the expected overall risk: R = λ 11 p(s impostords P imp + λ 12 p(s genuineds P gen + R 1 +λ 21 p(s impostords P imp + λ 22 p(s genuineds P gen since R 2 R 2 R 1 p(s impostords = 1 p(s impostords p(s genuineds = 1 p(s genuineds R 2 R = λ 11 p(s impostords P imp + λ 12 p(s genuineds P gen + R 1 R 1 R 1 +λ 21 P imp λ 21 P imp p(s impostords + λ 22 P gen λ 22 P gen p(s genuineds = R 1 λ 21 P imp + λ 22 P gen + P imp (λ 11 λ 21 p(s impostor + P gen (λ 12 λ 22 p(s genuineds R 1 ØThe integrand should be negative in order to minimize the risk! R 2 R 1 R 1 Machine Learning, Part 2, March 2017 Fabio Roli 36

37 Overall risk minimization P imp (λ 11 λ 21 p(s impostor + P gen (λ 12 λ 22 p(s genuine < 0 P imp (λ 21 λ 11 p(s impostor > P gen (λ 12 λ 22 p(s genuine p(s impostor p(s genuine > P gen (λ 12 λ 22 P imp (λ 21 λ 11 ØThe above is the minimum risk decison rule to be used for any pattern s ØThis proves that that the minimum risk decison rule, used for any pattern s, minimizes the expected overall risk. Machine Learning, Part 2, March 2017 Fabio Roli 37

38 Minimax decision rule In some real cases, priors can change over time, or their estimation can be difficult (intrusion detection, spamming, ecc.. We need a decision rule that can work even if the priors are not known An approach (used in many engineering problems is based on the worst-case design We design the classifier in order to minimize the risk in the worst-case in terms of priors variation Minimax: Mimize the Maximum risk This is worst-case design. The design is very conservative/pessimistic, and therefore, performance are not the optimal ones, they are optimal only for the worst case. Machine Learning, Part 2, March 2017 Fabio Roli 38

39 Machine Learning, Part 2, March 2017 Fabio Roli 39 Minimax decision rule We know that P 2 =1- P 1 and ( ( dx x p P x p P dx x p P x p P R R R = 2 1 ( ( ( ( ω λ ω λ ω λ ω λ = = ( ( ω ω P P P P ( 1 ( p x dx p x dx R R ω = ω ( ( ( ( ( = R R R ( ( ( dx x p dx x p P dx x p P R ω λ λ ω λ λ λ λ ω λ λ λ Let us consider a two class problem and two regions (non known at the beginning R 1 and R 2. The overall risk can be written as: Note: we express R as a function of P 1 and simplify the equation

40 Minimax R( P ( 1 = λ 22 + λ12 λ 22 p( x ω 2 dx + P R 1 R2 R 1 Note that costs and priors identifies the threshold θ. The threshold and the density functions defines the regions R 1 e R 2 and the overall risk. l( x = p( x / ω1 p( x / ω 2 ( λ 12 λ22 ( λ λ R = x : l( x > θ, R = x : l( x < θ { } { } 1 2 ω > < ω 1 2 P( ω2 P( ω = θ Varying P 1 changes the threshold θ(p 1, the decision regions, and the overall il risk. This does not allow to control the risk! This is the key point! ØKey issue: P 1 can change! I want to estimate the risk I could incur, that is, I want to evaluate it not depending on P 1 variations. Machine Learning, Part 2, March 2017 Fabio Roli

41 Minimax R = p( x ω2 dx R2 R ( P λ + ( λ λ 1 P R mm R 1, minimax risk The above equation shows that after identifying the decision regions (R 1 and R 2, overall risk is a linear function of P 1. If R 1 e R 2 makes zero the term in square brackets, then the overall risk does not depend on priors! Key point!... This is the minimax solution, and the minimax risk, R mm, is: R mm = λ 11 = λ Check this equality by exercise! ( λ λ 12 ( λ 21 λ11 R 2 22 Machine Learning, Part 2, March 2017 Fabio Roli 41 R p( x 1 p( x ω dx 1 ω dx = 2

42 Minimax: risk as a function of P(w 1 From the formula of R(P 1 it is easy to see that: P 1 = 0 implies that region R 1 is empty, therefore R=λ 22 P 1 = 1 implies that region R 2 is empty, therefore R=λ 11 Machine Learning, Part 2, March 2017 Fabio Roli 42

43 Minimax: risk linear function Let us assume that we have P(ω 1 =P 1 *=0.6, and therefore the risk associated to decision regions R 1 (P 1 * and R 2 (P 1 * is R(P 1 * If P 1 changes over time, the above equation shows that risk is a linear function, with decision regions identified by P 1 *=0.6. Machine Learning, Part 2, March 2017 Fabio Roli 43

44 Minimax: linear risk function To control the risk we choose P 1 * in order to have the red linear function with slope=0 The related regions R 1 (P 1 * e R 2 (P 1 * provides R max We are minimizing the maximum risk that we can incur when priors changes (Minimax. Indeed any other rule can provide higher risk when the prior P 1 * changes. Machine Learning, Part 2, March 2017 Fabio Roli 44

45 Minimax In order to identify the regions R 1 (θ e R 2 (θ associated to the Minimax line: λ11 λ 22 + λ21 λ 11 p( x ω 1 dx + λ22 λ 12 p( x ω 2 dx = 0 R2 R 1 ( ( ( ØWe should identify the regions that meet the above equation. ØThis can be done in closed forms in a few cases, but it is very difficult in the most of real cases. Machine Learning, Part 2, March 2017 Fabio Roli 45

46 Minimax straight line ( λ ( ( 11 λ 22 + λ21 λ 11 p( x ω 1 dx + λ22 λ 12 p( x ω 2 dx = 0 R2 R 1 ØThe above integrals are the error probabilities associated to the two classess (they are the so called P FA e P MA, it is easy to see that these errors can be controlled by θ. We could look for a threshold θ* which meet the above equation. This threshold value identifies R 1 (θ* and R 2 (θ*. R = x : l( x > θ *, R = x : l( x < θ* { } { } 1 2 Note that: θ ( ( 2 * λ λ P ω λ λ P( ω ( We are looking for the optimal threshold θ* without using the priors and the costs! Machine Learning, Part 2, March 2017 Fabio Roli 46

47 Minimax line with costs 0-1 ØFor loss matrix 0-1 p( x ω dx p( x ω dx = 0 p( x ω dx = p( x ω dx R R R R The threshold θ* is the one that makes equal the two error probabilities (P FA =P MA. In fingerprint recognition it is called EER (Equal Error Rate threshold. Impostor Genuine Note that the threshold θ* that makes P FA =P MA is not the optimal Bayesian threshold. Indeed we are using the Minimax rule, that is not the ideal minimum risk rule. Machine Learning, Part 2, March 2017 Fabio Roli 47

48 Minimax line: general case λ11 λ 22 + λ21 λ 11 p( x ω 1 dx + λ22 λ 12 p( x ω 2 dx = 0 R2 R 1 ( ( ( ØIn general, the empirical computation of the minimax line demands for a classifier that allows the control of the regions R 1 e R 2. ØThis is not always doable ØIf the loss matrix has values 0-1 we can change the parameters of the classifier so that: R p( x ω dx = p( x ω dx 1 2 R 2 1 ØThat, we tune the parameters in order to make equal the two error probabilities Machine Learning, Part 2, March 2017 Fabio Roli 48

49 Neyman-Pearson decision rule ØIf we do not know priors and costs, we can use the Neyman-Pearson decision rule. ØThis rule is used in applications where we have a constraint on the false alarm rate and we want to minimize the miss alarm rate (e.g., in radar applications or biometric recognition We fix a given P FA (false alarm rate, P FA = α. The Neyman-Pearson rule minimizes P MA with P FA =α. The minimization problem is formulated as follows: F = P MA + λ(p FA α = p(x ω 1 dx + λ p(x ω 2 dx R 1 R 2 = p(x ω 1 dx + λ 1 p(x ω 2 dx α R 2 R 2 = = λ(1 α + [ p(x ω 1 λp(x ω 2 ]dx R 2 α = Machine Learning, Part 2, March 2017 Fabio Roli 49

50 Neyman-Pearson rule for a two class problem Problem: identify the region R 2 that solves the above constrained minimization. We can disregard the constant term, and the minimization problem becomes: min R 2 R P FA = The terms in the squared brackets are positive: so we have the minimum when the integrand is negative for any x ε R 2. So R 2 = {x ε R: p(x ω 1 < λ p(x ω 2 } = {x ε R: l(x < λ}. Therefore the decision rule is: R 2 [ p(x ω 1 λp(x ω 2 ]dx R 1 p(x ω 2 dx = α l(x = p(x ω 1 p(x ω 2 ω 1 λ ω 2 Machine Learning, Part 2, March 2017 Fabio Roli 50

51 Neyman-Pearson rule: threshold computation The Neyman-Pearson is based on a likelihood ratio test. The threshold of this test comes from the constraint you have. As P FA = P FA (λ, the constraint P FA = λ identifies the values of λ. More precisely, introducing the function l=l(x (funzione di x, we have: where P FA = P{l x + λ * + ( > λ ω 2 } = p l (l ω 2 dl p l (l ω 2 dl = α = α λ = λ * λ ØIn practice, the following costraint is met by experiments: PFA α Machine Learning, Part 2, March 2017 Fabio Roli 51

52 Neyman-Pearson: biometric example Genuine Impostor In this case it is easy finding the threshold that provides The threshold value is identified by experiments PFA α Machine Learning, Part 2, March 2017 Fabio Roli 52

53 Neyman-Pearson rule and ROC curve With Neyman-Pearson rule, varying the threshold λ we obtain different P FA and different values of probability of detection P D (P D =1- P MA. To assess performance for varying threshold values one can use the ROC (Receiver Operating Characteristic curve. The ROC curve shows P D as a function of P FA for different threshold. P D 1 üthe ROC curve allows to analyse the trade-off between the false and miss alarm rates. üa good ROC curve should have small P FA and large P D value. üthe ideal ROC curve is a rectangular curve. ROC examples 1 P FA Machine Learning, Part 2, March 2017 Fabio Roli 53

54 Decision with reject option Even if one would be able to obtain the minimum Bayes error (threshold x B in figure, this error rate could be not acceptable for a given application Example: screening for medical diagnosis. I could demand for a false negative rate equal to zero. It is easy to see that this happens if üerrors are very costly, so we must protect against errors, limiting the error rate below a given threshold üan obvious way to limit decision errors is not making a decision or postponing decisions ØTo reduce error probability one can omit or defer decisions (reject option ØOmitting decisions is a rationale and doable option supposed that decisions are taken by other ways (e.g., by humans Machine Learning, Part 2, March 2017 Fabio Roli 54

55 Classification with reject option x R d CLASSIFIER x ω i Reject It is easy to see that rejection option demands for an additional class with respect to the standard formulation of the classification problem: Set of classes: Ω = {ω 1, ω 2,..., ω c }; Set of actions/decisions: A = {α o, α 1, α 2,..., α a }; If our action is a classification: A = {ω o, ω 1, ω 2,..., ω c }; ØWe have an additional class: ω o, the class containing the rejected samples Machine Learning, Part 2, March 2017 Fabio Roli 55

56 Loss matrix and minimum risk with reject option The loss matrix Λ, with size (c+1x c, is: Λ = λ(ω 0 ω 1 λ(ω 0 ω 2 λ(ω 0 ω c λ(ω 1 ω 1 λ(ω 1 ω 2 λ(ω 1 ω c λ(ω c ω 1 λ(ω c ω 1 λ(ω c ω c The minimum risk decision criterion is: x ω R( ω x < R( ω x i j, i=0,1,..,c i i j The main difference w.r.t. the case without reject option is that the minimum risk decision could be a rejection, if: R( ω x < R( ω x j 0 0 Machine Learning, Part 2, March 2017 Fabio Roli 56 j

57 Binary classification with equal costs Let us consider a binary classification with equal costs: Λ = λ r λ c λ c λ r λ e λ e = λ(ω 0 ω 1 λ(ω 0 ω 2 λ(ω 1 ω 1 λ(ω 1 ω 2 λ(ω 2 ω 2 λ(ω 2 ω 1 Reject cost = λ r Error cost = λ e Cost of correct classification = λ c (usually λ c =0 According to the minumum risk criterion, we have three decision regions: { : ( x ( 0} j x R = x R R ω < R ω j 0 0 { : ( x ( 1} j x R = x R R ω < R ω j 1 1 { : ( x ( 2} j x R = x R R ω < R ω j 2 2 Machine Learning, Part 2, March 2017 Fabio Roli 57

58 Binary classification with equal costs The overall risk R can be written as: e c r [ (, (, ] R = λ P rigetto x ω + P rigetto x ω [ P( errore, x P( errore, x ] +λ ω + ω [ P( corretto, x P( corretto, x ] +λ ω + ω = 1 2 = λr P( ω1 p( x / ω 1 dx + P( ω2 p( x / ω 2 dx + + R0 R 0 +λe P( ω1 p( x / ω 1 dx + P( ω2 p( x / ω 2 dx + + R2 R1 +λc P( ω1 p( x / ω 1 dx + P( ω2 p( x / ω 2 dx + R1 R 2 Machine Learning, Part 2, March 2017 Fabio Roli 58

59 The error-reject trade-off From the previous equation we obtain: R = λ r P(reject + λ e P(error + λ c P(correct being P(reject + P(error + P(correct = 1 we have R = (λ r λ c P(reject + (λ e λ c P(error ØThe above formulation of the overall risk clearly shows that a given value of the risk can be obtained by a trade-off between error probability and reject probability: error-reject trade-off ØThe trade-off is also clearly shown by the relation of the three error probabilities: P(correct=1-P(reject-P(error ØThe above equation shows that we can reduce error probability by increasing reject probability, and vice versa. Machine Learning, Part 2, March 2017 Fabio Roli 59

60 Error-reject trade-off and Chow s rule Requirements of practical applications are often given as: minimize error probability with P(reject lower than r (e.g., minimize error with reject<15% These requirements can be satisfied by the Chow s rule (C.K. Chow, 1957, 1970, that is the optimal rule with reject option: if max P(ω i i / x T x ω i otherwise reject x with T= λ e λ r λ e λ c T is the reject threshold T ε [0..1] because λ λ For T=0 (λ e =λ r we have the classical MAP rule ØWe can show (C.K. Chow, 1957 that Chow s rule minimizes error probability (that is, maximize classification accuracy for any value of the reject probability. ØIt is easy to see that Chow s rule minimizes error by rejecting patterns for which the classification is not reliable enough. c r Machine Learning, Part 2, March 2017 Fabio Roli 60

61 Example of use of the Chow s rule P(ω1 x P(ω2 x T x R 1 R 0 R 2 Two classes with Gaussian distribution The reject threshold T identifies the reject region R 0 This example clearly shows that error probability can be reduced by increasing the reject threshold T. Error becomes zero when the region R 0 contains all the patterns which are misclassified. Machine Learning, Part 2, March 2017 Fabio Roli 61

62 Examples of error-reject trade-off Error Hypothetical trade-off curve T increases, then the rejection incresaes and error decreases (error-reject trade-off Reject Examples of accuracy-rejection for different OCR (Optical Character Recognition algorithms Machine Learning, Part 2, March 2017 Fabio Roli 62

63 Discriminant functions, decision surfaces/regions So far we represented a classifier as a machine that takes as input the pattern x and assigns it to a class according to the minimu risk theory: x x ω R( ω x < R( ω x i j, i=1,...,c i i j ω i An alternative representation of a pattern classifier is in terms of a set of discriminant functions g i (x, i=1,..,c. We assign x to the class ω i if g i (x> g j (x, j i Machine Learning, Part 2, March 2017 Fabio Roli 63

64 Machine Learning, Part 2, March 2017 Fabio Roli 64 In general, we can consider g i (x=-r(ω i x ; the discriminat function is aimed to minimize the risk. If we want to minimize the error probability: g i (x=p(ω 1 /x We have many possible choices for g i (x; we can replace g i (x with f(g i (x, where f( is a increasing monotonic function. In particolar, if we want to minimize the error probability all the following choices are appropriate: ( ( ( ln ( ln ( ( ( ( ( ( ( ( ( ( 1 i i i i i i c j j j i i i i P p g P p g P p P p P g ω ω ω ω ω ω ω ω ω + = = = = = x x x x x x x x Discriminant functions, decision surfaces/regions

65 Discriminant functions, decision surfaces/regions Discriminat functions subdivide the feature space into c decision regions R 1 R c. If g i (x> g j (x for any j i, then x ε R i, and it is assigned to the class ω i. Decision boundaries among regions are specified by g i (x=g j (x, considering the two discriminant functions exhibiting maximum values Bi-dimensional example. Two classes with Gaussian distributions. Quadratic decision surfaces. Region R 2 is not simply connected. Machine Learning, Part 2, March 2017 Fabio Roli 65

66 Machine Learning, Part 2, March 2017 Fabio Roli 66 Discriminant functions for binary classification For a two-class problem, we can use a single discriminant function g(x=g 1 (x- g 2 (x Se g(x>0 then ω 1, otherwise ω 2 The following forms of the discriminant function can be used for a two-class problem: A pattern classifier based on one of the above discriminat functions for a two-class task is called dicotomic classifier. + = = ( ( ln ( ( ln ( ( ( ( ω ω ω ω ω ω P P p p g P P g x x x x x x

67 References ØSections 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, Pattern Classification, R.O. Duda, P. E. Hart, and D. G. Stork, John Wiley & Sons, 2000 ØChapter 1, Statistical Pattern Recognition, Andrew Webb, John Wiley & Sons, 2002 ØC.K. Chow, On optimum error and reject trade-off, IEEE Trans. on Information Theory 16 ( Machine Learning, Part 2, March 2017 Fabio Roli 67

Pattern Classification

Pattern Classification Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors

More information

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle

More information

Error Rates. Error vs Threshold. ROC Curve. Biometrics: A Pattern Recognition System. Pattern classification. Biometrics CSE 190 Lecture 3

Error Rates. Error vs Threshold. ROC Curve. Biometrics: A Pattern Recognition System. Pattern classification. Biometrics CSE 190 Lecture 3 Biometrics: A Pattern Recognition System Yes/No Pattern classification Biometrics CSE 190 Lecture 3 Authentication False accept rate (FAR): Proportion of imposters accepted False reject rate (FRR): Proportion

More information

Bayesian Decision Theory

Bayesian Decision Theory Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 1 / 46 Bayesian

More information

Bayesian Decision Theory

Bayesian Decision Theory Bayesian Decision Theory Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Bayesian Decision Theory Bayesian classification for normal distributions Error Probabilities

More information

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Bayesian decision theory 8001652 Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory Jussi Tohka jussi.tohka@tut.fi Institute of Signal Processing Tampere University of Technology

More information

44 CHAPTER 2. BAYESIAN DECISION THEORY

44 CHAPTER 2. BAYESIAN DECISION THEORY 44 CHAPTER 2. BAYESIAN DECISION THEORY Problems Section 2.1 1. In the two-category case, under the Bayes decision rule the conditional error is given by Eq. 7. Even if the posterior densities are continuous,

More information

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30 Problem Set MAS 6J/1.16J: Pattern Recognition and Analysis Due: 5:00 p.m. on September 30 [Note: All instructions to plot data or write a program should be carried out using Matlab. In order to maintain

More information

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I SYDE 372 Introduction to Pattern Recognition Probability Measures for Classification: Part I Alexander Wong Department of Systems Design Engineering University of Waterloo Outline 1 2 3 4 Why use probability

More information

p(x ω i 0.4 ω 2 ω

p(x ω i 0.4 ω 2 ω p(x ω i ).4 ω.3.. 9 3 4 5 x FIGURE.. Hypothetical class-conditional probability density functions show the probability density of measuring a particular feature value x given the pattern is in category

More information

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1) Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1) Detection problems can usually be casted as binary or M-ary hypothesis testing problems. Applications: This chapter: Simple hypothesis

More information

Classification. 1. Strategies for classification 2. Minimizing the probability for misclassification 3. Risk minimization

Classification. 1. Strategies for classification 2. Minimizing the probability for misclassification 3. Risk minimization Classification Volker Blobel University of Hamburg March 2005 Given objects (e.g. particle tracks), which have certain features (e.g. momentum p, specific energy loss de/ dx) and which belong to one of

More information

Bayesian Decision Theory

Bayesian Decision Theory Introduction to Pattern Recognition [ Part 4 ] Mahdi Vasighi Remarks It is quite common to assume that the data in each class are adequately described by a Gaussian distribution. Bayesian classifier is

More information

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 3 CS434a/541a: attern Recognition rof. Olga Veksler Lecture 3 1 Announcements Link to error data in the book Reading assignment Assignment 1 handed out, due Oct. 4 lease send me an email with your name and

More information

Bayesian Decision Theory Lecture 2

Bayesian Decision Theory Lecture 2 Bayesian Decision Theory Lecture 2 Jason Corso SUNY at Buffalo 14 January 2009 J. Corso (SUNY at Buffalo) Bayesian Decision Theory Lecture 2 14 January 2009 1 / 58 Overview and Plan Covering Chapter 2

More information

Machine Learning 2017

Machine Learning 2017 Machine Learning 2017 Volker Roth Department of Mathematics & Computer Science University of Basel 21st March 2017 Volker Roth (University of Basel) Machine Learning 2017 21st March 2017 1 / 41 Section

More information

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30 Problem Set 2 MAS 622J/1.126J: Pattern Recognition and Analysis Due: 5:00 p.m. on September 30 [Note: All instructions to plot data or write a program should be carried out using Matlab. In order to maintain

More information

Detection theory 101 ELEC-E5410 Signal Processing for Communications

Detection theory 101 ELEC-E5410 Signal Processing for Communications Detection theory 101 ELEC-E5410 Signal Processing for Communications Binary hypothesis testing Null hypothesis H 0 : e.g. noise only Alternative hypothesis H 1 : signal + noise p(x;h 0 ) γ p(x;h 1 ) Trade-off

More information

Contents 2 Bayesian decision theory

Contents 2 Bayesian decision theory Contents Bayesian decision theory 3. Introduction... 3. Bayesian Decision Theory Continuous Features... 7.. Two-Category Classification... 8.3 Minimum-Error-Rate Classification... 9.3. *Minimax Criterion....3.

More information

p(x ω i 0.4 ω 2 ω

p(x ω i 0.4 ω 2 ω p( ω i ). ω.3.. 9 3 FIGURE.. Hypothetical class-conditional probability density functions show the probability density of measuring a particular feature value given the pattern is in category ω i.if represents

More information

Introduction to Statistical Inference

Introduction to Statistical Inference Structural Health Monitoring Using Statistical Pattern Recognition Introduction to Statistical Inference Presented by Charles R. Farrar, Ph.D., P.E. Outline Introduce statistical decision making for Structural

More information

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition Memorial University of Newfoundland Pattern Recognition Lecture 6 May 18, 2006 http://www.engr.mun.ca/~charlesr Office Hours: Tuesdays & Thursdays 8:30-9:30 PM EN-3026 Review Distance-based Classification

More information

If there exists a threshold k 0 such that. then we can take k = k 0 γ =0 and achieve a test of size α. c 2004 by Mark R. Bell,

If there exists a threshold k 0 such that. then we can take k = k 0 γ =0 and achieve a test of size α. c 2004 by Mark R. Bell, Recall The Neyman-Pearson Lemma Neyman-Pearson Lemma: Let Θ = {θ 0, θ }, and let F θ0 (x) be the cdf of the random vector X under hypothesis and F θ (x) be its cdf under hypothesis. Assume that the cdfs

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Introduction to Signal Detection and Classification. Phani Chavali

Introduction to Signal Detection and Classification. Phani Chavali Introduction to Signal Detection and Classification Phani Chavali Outline Detection Problem Performance Measures Receiver Operating Characteristics (ROC) F-Test - Test Linear Discriminant Analysis (LDA)

More information

Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout :. The Multivariate Gaussian & Decision Boundaries..15.1.5 1 8 6 6 8 1 Mark Gales mjfg@eng.cam.ac.uk Lent

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Minimum Error-Rate Discriminant

Minimum Error-Rate Discriminant Discriminants Minimum Error-Rate Discriminant In the case of zero-one loss function, the Bayes Discriminant can be further simplified: g i (x) =P (ω i x). (29) J. Corso (SUNY at Buffalo) Bayesian Decision

More information

Detection Theory. Chapter 3. Statistical Decision Theory I. Isael Diaz Oct 26th 2010

Detection Theory. Chapter 3. Statistical Decision Theory I. Isael Diaz Oct 26th 2010 Detection Theory Chapter 3. Statistical Decision Theory I. Isael Diaz Oct 26th 2010 Outline Neyman-Pearson Theorem Detector Performance Irrelevant Data Minimum Probability of Error Bayes Risk Multiple

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Slide Set 3: Detection Theory January 2018 Heikki Huttunen heikki.huttunen@tut.fi Department of Signal Processing Tampere University of Technology Detection theory

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang

More information

Detection theory. H 0 : x[n] = w[n]

Detection theory. H 0 : x[n] = w[n] Detection Theory Detection theory A the last topic of the course, we will briefly consider detection theory. The methods are based on estimation theory and attempt to answer questions such as Is a signal

More information

What does Bayes theorem give us? Lets revisit the ball in the box example.

What does Bayes theorem give us? Lets revisit the ball in the box example. ECE 6430 Pattern Recognition and Analysis Fall 2011 Lecture Notes - 2 What does Bayes theorem give us? Lets revisit the ball in the box example. Figure 1: Boxes with colored balls Last class we answered

More information

ADVANCED METHODS FOR PATTERN RECOGNITION

ADVANCED METHODS FOR PATTERN RECOGNITION Università degli Studi di Cagliari Dottorato di Ricerca in Ingegneria Elettronica e Informatica XIV Ciclo ADVANCED METHODS FOR PATTERN RECOGNITION WITH REJECT OPTION Relatore Prof. Ing. Fabio ROLI Tesi

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Bayes Decision Theory

Bayes Decision Theory Bayes Decision Theory Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density 0 Minimum-Error-Rate Classification Actions are decisions on classes

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

Deep Learning for Computer Vision

Deep Learning for Computer Vision Deep Learning for Computer Vision Lecture 3: Probability, Bayes Theorem, and Bayes Classification Peter Belhumeur Computer Science Columbia University Probability Should you play this game? Game: A fair

More information

Intelligent Systems Statistical Machine Learning

Intelligent Systems Statistical Machine Learning Intelligent Systems Statistical Machine Learning Carsten Rother, Dmitrij Schlesinger WS2015/2016, Our model and tasks The model: two variables are usually present: - the first one is typically discrete

More information

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1) HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter

More information

Intelligent Systems Statistical Machine Learning

Intelligent Systems Statistical Machine Learning Intelligent Systems Statistical Machine Learning Carsten Rother, Dmitrij Schlesinger WS2014/2015, Our tasks (recap) The model: two variables are usually present: - the first one is typically discrete k

More information

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart 1 Motivation and Problem In Lecture 1 we briefly saw how histograms

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher Lecture 3 STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher Previous lectures What is machine learning? Objectives of machine learning Supervised and

More information

Bayes Decision Theory

Bayes Decision Theory Bayes Decision Theory Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 / 16

More information

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier Machine Learning 10-701/15 701/15-781, 781, Spring 2008 Theory of Classification and Nonparametric Classifier Eric Xing Lecture 2, January 16, 2006 Reading: Chap. 2,5 CB and handouts Outline What is theoretically

More information

p(d θ ) l(θ ) 1.2 x x x

p(d θ ) l(θ ) 1.2 x x x p(d θ ).2 x 0-7 0.8 x 0-7 0.4 x 0-7 l(θ ) -20-40 -60-80 -00 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ ˆ 2 3 4 5 6 7 θ θ x FIGURE 3.. The top graph shows several training points in one dimension, known or assumed to

More information

Bayesian decision making

Bayesian decision making Bayesian decision making Václav Hlaváč Czech Technical University in Prague Czech Institute of Informatics, Robotics and Cybernetics 166 36 Prague 6, Jugoslávských partyzánů 1580/3, Czech Republic http://people.ciirc.cvut.cz/hlavac,

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Nearest Neighbor Pattern Classification

Nearest Neighbor Pattern Classification Nearest Neighbor Pattern Classification T. M. Cover and P. E. Hart May 15, 2018 1 The Intro The nearest neighbor algorithm/rule (NN) is the simplest nonparametric decisions procedure, that assigns to unclassified

More information

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM 1 Support Vector Machines (SVM) in bioinformatics Day 1: Introduction to SVM Jean-Philippe Vert Bioinformatics Center, Kyoto University, Japan Jean-Philippe.Vert@mines.org Human Genome Center, University

More information

Biometrics: Introduction and Examples. Raymond Veldhuis

Biometrics: Introduction and Examples. Raymond Veldhuis Biometrics: Introduction and Examples Raymond Veldhuis 1 Overview Biometric recognition Face recognition Challenges Transparent face recognition Large-scale identification Watch list Anonymous biometrics

More information

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary ECE 830 Spring 207 Instructor: R. Willett Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics Executive summary In the last lecture we saw that the likelihood

More information

Score calibration for optimal biometric identification

Score calibration for optimal biometric identification Score calibration for optimal biometric identification (see also NIST IBPC 2010 online proceedings: http://biometrics.nist.gov/ibpc2010) AI/GI/CRV 2010, Ottawa Dmitry O. Gorodnichy Head of Video Surveillance

More information

Classifier Performance. Assessment and Improvement

Classifier Performance. Assessment and Improvement Classifier Performance Assessment and Improvement Error Rates Define the Error Rate function Q( ω ˆ,ω) = δ( ω ˆ ω) = 1 if ω ˆ ω = 0 0 otherwise When training a classifier, the Apparent error rate (or Test

More information

Topic 3: Hypothesis Testing

Topic 3: Hypothesis Testing CS 8850: Advanced Machine Learning Fall 07 Topic 3: Hypothesis Testing Instructor: Daniel L. Pimentel-Alarcón c Copyright 07 3. Introduction One of the simplest inference problems is that of deciding between

More information

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation Lecture 15. Pattern Classification (I): Statistical Formulation Outline Statistical Pattern Recognition Maximum Posterior Probability (MAP) Classifier Maximum Likelihood (ML) Classifier K-Nearest Neighbor

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Detection and Estimation Chapter 1. Hypothesis Testing

Detection and Estimation Chapter 1. Hypothesis Testing Detection and Estimation Chapter 1. Hypothesis Testing Husheng Li Min Kao Department of Electrical Engineering and Computer Science University of Tennessee, Knoxville Spring, 2015 1/20 Syllabus Homework:

More information

If you wish to cite this paper, please use the following reference:

If you wish to cite this paper, please use the following reference: This is an accepted version of a paper published in Proceedings of the st IEEE International Workshop on Information Forensics and Security (WIFS 2009). If you wish to cite this paper, please use the following

More information

Bayesian Decision Theory

Bayesian Decision Theory Bayesian Decision Theory 1/27 lecturer: authors: Jiri Matas, matas@cmp.felk.cvut.cz Václav Hlaváč, Jiri Matas Czech Technical University, Faculty of Electrical Engineering Department of Cybernetics, Center

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Chapter 2 Signal Processing at Receivers: Detection Theory

Chapter 2 Signal Processing at Receivers: Detection Theory Chapter Signal Processing at Receivers: Detection Theory As an application of the statistical hypothesis testing, signal detection plays a key role in signal processing at receivers of wireless communication

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Pattern Recognition Problem. Pattern Recognition Problems. Pattern Recognition Problems. Pattern Recognition: OCR. Pattern Recognition Books

Pattern Recognition Problem. Pattern Recognition Problems. Pattern Recognition Problems. Pattern Recognition: OCR. Pattern Recognition Books Introduction to Statistical Pattern Recognition Pattern Recognition Problem R.P.W. Duin Pattern Recognition Group Delft University of Technology The Netherlands prcourse@prtools.org What is this? What

More information

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,

More information

Advanced statistical methods for data analysis Lecture 1

Advanced statistical methods for data analysis Lecture 1 Advanced statistical methods for data analysis Lecture 1 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1 Outline

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

DETECTION theory deals primarily with techniques for

DETECTION theory deals primarily with techniques for ADVANCED SIGNAL PROCESSING SE Optimum Detection of Deterministic and Random Signals Stefan Tertinek Graz University of Technology turtle@sbox.tugraz.at Abstract This paper introduces various methods for

More information

Introduction to Machine Learning

Introduction to Machine Learning Outline Introduction to Machine Learning Bayesian Classification Varun Chandola March 8, 017 1. {circular,large,light,smooth,thick}, malignant. {circular,large,light,irregular,thick}, malignant 3. {oval,large,dark,smooth,thin},

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Machine Perceptual Learning and Sensory Summer Augmented 6 Computing Announcements Machine Learning Lecture 2 Course webpage http://www.vision.rwth-aachen.de/teaching/ Slides will be made available on

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Topics Discriminant functions Logistic regression Perceptron Generative models Generative vs. discriminative

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and Estimation I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 22, 2015

More information

INFORMATION PROCESSING ABILITY OF BINARY DETECTORS AND BLOCK DECODERS. Michael A. Lexa and Don H. Johnson

INFORMATION PROCESSING ABILITY OF BINARY DETECTORS AND BLOCK DECODERS. Michael A. Lexa and Don H. Johnson INFORMATION PROCESSING ABILITY OF BINARY DETECTORS AND BLOCK DECODERS Michael A. Lexa and Don H. Johnson Rice University Department of Electrical and Computer Engineering Houston, TX 775-892 amlexa@rice.edu,

More information

outline Nonlinear transformation Error measures Noisy targets Preambles to the theory

outline Nonlinear transformation Error measures Noisy targets Preambles to the theory Error and Noise outline Nonlinear transformation Error measures Noisy targets Preambles to the theory Linear is limited Data Hypothesis Linear in what? Linear regression implements Linear classification

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Mixtures of Gaussians with Sparse Structure

Mixtures of Gaussians with Sparse Structure Mixtures of Gaussians with Sparse Structure Costas Boulis 1 Abstract When fitting a mixture of Gaussians to training data there are usually two choices for the type of Gaussians used. Either diagonal or

More information

Bayes Rule for Minimizing Risk

Bayes Rule for Minimizing Risk Bayes Rule for Minimizing Risk Dennis Lee April 1, 014 Introduction In class we discussed Bayes rule for minimizing the probability of error. Our goal is to generalize this rule to minimize risk instead

More information

Introduction to Detection Theory

Introduction to Detection Theory Introduction to Detection Theory Detection Theory (a.k.a. decision theory or hypothesis testing) is concerned with situations where we need to make a decision on whether an event (out of M possible events)

More information

Linear Classifiers as Pattern Detectors

Linear Classifiers as Pattern Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2014/2015 Lesson 16 8 April 2015 Contents Linear Classifiers as Pattern Detectors Notation...2 Linear

More information

Optimum Joint Detection and Estimation

Optimum Joint Detection and Estimation 20 IEEE International Symposium on Information Theory Proceedings Optimum Joint Detection and Estimation George V. Moustakides Department of Electrical and Computer Engineering University of Patras, 26500

More information

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)?

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)? ECE 830 / CS 76 Spring 06 Instructors: R. Willett & R. Nowak Lecture 3: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics Executive summary In the last lecture we

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Object Recognition. Digital Image Processing. Object Recognition. Introduction. Patterns and pattern classes. Pattern vectors (cont.

Object Recognition. Digital Image Processing. Object Recognition. Introduction. Patterns and pattern classes. Pattern vectors (cont. 3/5/03 Digital Image Processing Obect Recognition Obect Recognition One of the most interesting aspects of the world is that it can be considered to be made up of patterns. Christophoros Nikou cnikou@cs.uoi.gr

More information

Classifier performance evaluation

Classifier performance evaluation Classifier performance evaluation Václav Hlaváč Czech Technical University in Prague Czech Institute of Informatics, Robotics and Cybernetics 166 36 Prague 6, Jugoslávských partyzánu 1580/3, Czech Republic

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

PATTERN CLASSIFICATION

PATTERN CLASSIFICATION PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

More information

The Naïve Bayes Classifier. Machine Learning Fall 2017

The Naïve Bayes Classifier. Machine Learning Fall 2017 The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning

More information

Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart

Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart Statistical and Learning Techniques in Computer Vision Lecture 1: Random Variables Jens Rittscher and Chuck Stewart 1 Motivation Imaging is a stochastic process: If we take all the different sources of

More information

Necessary Corrections in Intransitive Likelihood-Ratio Classifiers

Necessary Corrections in Intransitive Likelihood-Ratio Classifiers Necessary Corrections in Intransitive Likelihood-Ratio Classifiers Gang Ji and Jeff Bilmes SSLI-Lab, Department of Electrical Engineering University of Washington Seattle, WA 9895-500 {gang,bilmes}@ee.washington.edu

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information

Pattern Recognition. Parameter Estimation of Probability Density Functions

Pattern Recognition. Parameter Estimation of Probability Density Functions Pattern Recognition Parameter Estimation of Probability Density Functions Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The

More information