Bayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina

Size: px

Start display at page:

Download "Bayes rule and Bayes error. Donglin Zeng, Department of Biostatistics, University of North Carolina"

Lester Barker
5 years ago
Views:

1 Bayes rule and Bayes error

2 Definition If f minimizes E[L(Y, f (X))], then f is called a Bayes rule (associated with the loss function L(y, f )) and the resulting prediction error rate, E[L(Y, f (X))], is called the Bayes error. Remark 1: this definition does not restrict the choice of f ; f may not exist always. Remark 2: the Bayes rule may not be unique, for example, if L(Y, f ) = I(Y f ) and Y { 1, 1}, then any function f with the same sign as f is the Bayes rule. Remark 3: the Bayes error is the minimal error rate for future prediction and it is different from (often larger than) the average training error based on empirical data: n 1 n L(Y i, f (X i )). i=1

3 Squared loss function The loss function is L(y, f ) = (y f ) 2. The Bayes rule is f (x) = E[Y X = x]. The Bayes error is EPE(f ) = E[Var(Y X)].

4 Absolute deviation loss The loss function is The Bayes rule is The Beyes error is A more general loss is L(y, f ) = y f. f (x) = median(y x = x). EPE(f ) = E[ Y med(y X) ]. L(y, f ) = (y f ) {τ I(y < f )}. The Beyes rule is the τ-quantile for the distribution of Y given X = x.

5 Misclassification loss Consider Y to be categorical with labels 1,..., K. The misclassification loss is The Bayes rule is k=1 L(y, f ) = I(y f ). f (x) = argmax k P(Y = k X = x). The Bayes error is K [ ] EPE(f ) = E P(Y = k X)I(P(Y = k X) < max P(Y = j X)) j = 1 E[max P(Y = j X)]. j Especially when K = 2, this error is E[1 max(η(x), 1 η(x))] = E[min(η(X), 1 η(x))], where η(x) = P(Y = 1 X).

6 More on misclassification class (K = 2) EPE(f ) = E [ 2η(X) 1 ]. 2 For k {0, 1}, let f k (x) be the density of X given Y = k. Then EPE(f ) = f 1 (x) f 0 (x) dx. 4 For any f (x) = I( η(x) 1/2) where η(x) [0, 1], E[I(Y f (X))] EPE(f ) = 2 and η(x) 1 2 I(f (x) f (x))dµ(x) E[I(Y f (X))] EPE(f ) 2E[ η(x) η(x).

7 Interpretation of Bayes error The Bayes error measures how difficult the classification problem is. Clearly, the closer η(x) is to 0 or 1, the more separable data are, so the smaller the Bayes error is. The Bayes error also describes the discriminatory power in the distribution of (Y, X).

8 Alternative discriminatory distances There are other discriminatory distances including the Kolmogorov variational distance: E[ P(Y = 0 X) P(Y = 1 X) ] = E[ 2η(X) 1 ]/2; the Nearest Neighbor error: E[2η(X)(1 { η(x))]; η(x)(1 } the Bhattacharyya affinity: log E η(x)) ; the entropy: E [η(x) log η(x) + (1 η(x)) log(1 η(x))] ; Jeffreys divergence: E [(2η(X) 1) log{η(x)/(1 η(x))}].

9 Surrogate loss function A surrogate loss function is a substitute for L(y, f ) used for supervised learning. A surrogate loss function often has some desirable properties that facilitate computation such as continuity and convexity. A surrogate loss function, L S (y, f ), is called Fisher consistency if the minimizer of E[L S (Y, f (X))], after certain transformation, leads to the Bayes rule.

10 Fisher consistent loss for classification Now we use label Y { 1, 1}. Since the classification rule can always be determined by the sign of some decision function, without confusion, we use f (X) for this decision function. That is, the rule is sign(f ). Consider a surrogate loss L S (y, f ) = (y f ) 2. The minimizer is f (X) = E[Y X = x] so if we use sign(f (X)) as the prediction rule, the rule is the Bayes rule.

11 Large-margin loss Remind Y { 1, 1}. More general, consider a surrogate loss L S (y, f ) = φ(yf ), φ(z) is convex, φ(z) is differetniable at zero and φ (0) < 0. Then the rule defined as sign( f (X)), where f minimizes E[L S (Y, f (X))] is the Bayes rule. The above L S (y, f ) is called a large-margin loss including the square loss and the hinge loss defined as (1 yf ) +.

12 General remarks The Bayes error serves as the gold standard for assessing the performance of any prediction rule. The Bayes error quantifies the difficulty of prediction based on data s distribution. For many loss functions, the corresponding Bayes errors do not have analytic calculation. Using an appropriate surrogate loss can leads to the same Bayes rule.

13 Learning Bayes rule For some loss functions such as squared loss and misclassification loss, we have explicit solutions in terms of the joint distribution of (Y, X). Hence, direct learning methods estimate these solutions using empirical data so many statistical models can be used for this purpose. We group these methods based on model assumptions: parametric, semiparametric and nonparametric.

14 Learning Bayes rule (more) For loss functions that do not yield explicit solutions of the Bayes rules, we can learn the rules by minimizing some approximate objective functions using empirical data. We call these methods as indirect learning. There are two levels of approximations: approximate E[L(Y, f (X))] by n 1 n L(Y i, f (X i )), (empirical risk approximation); i=1 substitute L(Y, f (X)) by L S (Y, f (X)) for some surrogate loss L S for computation convenience. The first approximation introduces stochastic error due to data randomness; the second approximation introduces extra bias due to using non-authentic loss.

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection