Bayesian Decision Theory - PDF Free Download

Introduction to Pattern Recognition [ Part 4 ] Mahdi Vasighi

Remarks It is quite common to assume that the data in each class are adequately described by a Gaussian distribution. Bayesian classifier is either linear or quadratic. In statistics, this approach to the classification task is known as linear discriminant analysis (LDA) or quadratic discriminant analysis(qda)

Remarks

Error probabilities We can obtain additional insight into the operation of a general classifier if we consider the sources of its error. Suppose the dichotomizer has divided the space into two regions R 1 and R 2 in a possibly non-optimal way. For multicategory cases:

Operating Characteristic Started in electronic signal detection theory (1940s - 1950s) A simple example: A person is faced with a signal and must make a decision, is the signal there or not. What makes this situation confusing and difficult is the presences of other mess that is similar to the signal. It has become very popular in biomedical applications. Also used in machine learning applications to assess classifiers ROC Receiver Operating Characteristic

Suppose we are interested in cancer test using a measured blood parameter like protein level (x). o x has mean µ 2 for cancerous sample, o x has mean µ 1 for healthy sample. o the actual value is a random variable. o The classifier employs a threshold value x for determination o The outcomes are labeled either as positive or negative

P(x>x x ω 2 ): the probability that the protein level (x) is above x given that the sample is cancerous. True positive (TP) (Hit) False positive (FP) (False alarm) P(x>x x ω 1 ): the probability that the protein level (x) is above x despite the sample is healthy.

P(x<x x ω 2 ): the probability that the protein level (x) is below x given that the sample is cancerous. True Negative (TN) (Correct rejection) False Negative (FN) (Miss) P(x<x x ω 1 ): the probability that the protein level (x) is below x given that the sample is healthy.

Accuracy is defined as the proportion of patterns predicted truly among all patterns positive/negative refers to prediction true/false refers to correctness Confusion matrix Assigned TN FN TP True 1 2 1 2 TN FP FN TP N P FP we assume here that we know only the state of nature and the decision of the system.

Accuracy is defined as the proportion of patterns predicted truly among all patterns Sensitivity is defined as the proportion of patterns predicted truly as positive among all positive patterns (Recall) Confusion matrix Assigned 1 2 True 1 2 TN FN FP TP N P False Positive Rate is defined as the proportion of patterns predicted falsely as positive among all negative patterns Specificity is the proportion of patterns predicted truly as negative among all negative responses

perfect classification A ROC space 1 Sensitivity B D Random guess C 1- Specificity ROC Space depicts relative trade-offs between true positive (benefits) and false positive (costs)

ROC space P(x i ) Sensitivity 1 2 1- Specificity x* x The actual shape of the curve is determined by how much overlap the two distributions have.

ROC space P(x i ) Sensitivity 1 2 1- Specificity x

ROC curves are often summarized into a single metric known as the: Area under the curve (AUC). Sensitivity d ROC space J ROC curve can be used for determination of the optimal cutoff point o minimize the distance to topleft corner (d) o identify the point with furthest vertical distance from the diagonal line (J) 1- Specificity AUC can be used for performance comparison across different classifiers

We should note that since the distributions can be arbitrary, the operating characteristic need not be symmetric

Receiver operating characteristic Syntax [tpr,fpr,thresholds] roc(targets,outputs) load iris_dataset net patternnet(20); net train(net,irisinputs,iristargets); irisoutputs sim(net,irisinputs); [tpr,fpr,thresholds] roc(iristargets,irisoutputs) plotroc(targets,outputs) Is it possible to perform ROC analysis for a multiclass classification problem?

Discrete Features In many practical applications the components of x are binary-, ternary-, or higher integer valued, so that x can assume only one of d discrete value. The definition of the conditional risk R(α x) is unchanged, and the fundamental Bayes decision rule remains the same. consider the two-category problem in which the components of the feature vector are binary-valued and conditionally independent

Discrete Features Let x(x 1,..., x d ) t each feature gives us a yes/no answer about the pattern Likelihood ratio ln ln

Summary Basic ideas of Bayes decision theory is to minimize the overall risk Choose the action that minimizes the conditional risk Weighted posterior for different penalties for misclassifying patterns If the underlying distributions are multivariate Gaussian, the decision boundaries will be hyper-quadrics Receiver operating characteristic curves describe the inherent and unchangeable properties of a classifier and can be used, for example, to determine the Bayes rate.