On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality Weiqiang Dong 1

The goal of the work presented here is to illustrate that classification error responds to error in the target probability estimates in a much different (and perhaps less intuitive) way than squared estimation error. 2

Overview Function Estimation and Estimation Error Classification and Classification Error Discussion 3

Function Estimate Input: x Output: y = f x + ε where f (x) ( target function ) is a single valued deterministic function of x and ε is a random variable, E(ε x) = 0. The goal is to obtain an estimate using a training data set T 4

Estimation Error The goal is to obtain an estimate using a training data set T Mean Square Error: E T [y f x T)] 2 = [f(x) E T f(x T)] 2 +E T [ f(x T) E T f(x T)] 2 + E ε [ε x] 2 1. Square of bias 2. Variance 3. Irreducible prediction error 5

E T [y f x T)]2 = [f(x) E T f(x T)]2 + E T [ f(x T) E T f(x T)]2 + E ε [ε x] 2 1. Square of bias 2. Variance 3. Irreducible prediction error 1. Square of bias: The extent to which the average prediction over all data sets differs from the desired regression function. 2. Variance: The extent to which the solutions for individual data sets vary around their average (sensitivity to the particular choice of data set). 6

Bias-Variance Trade-off y = f x + ε f x = sin(2πx) x is uniform distributed. The target data set T was obtained by first computing the corresponding values of the function sin(2πx) and then adding a small level of random noise. Christopher Bishop, Pattern Recognition and Machine Learning, 2006 7

Bias-Variance Trade-off y = f x + ε f x = sin(2πx) We generate 25 data sets from f(x), each contain 25 data points. For each data set we fit the data using a polynomial function. Christopher Bishop, Pattern Recognition and Machine Learning, 2006 8

The left column shows the result of fitting the model to the 25 data sets The right column shows the corresponding average of the 25 fits Christopher Bishop, Pattern Recognition and Machine Learning, 2006 9

Bias-Variance Trade-off E T [y f x T)]2 = [f(x) E T f(x T)]2 + E T [ f(x T) E T f(x T)]2 + E ε [ε x] 2 1. Square of bias 2. Variance 3. Irreducible prediction error It is desirable to have both low biassquared and low variance since both contribute to the squared estimation error in equal measure. However, there is a natural bias-variance trade-off associated with function approximation. 10

Classification Input: x = {x 1,, x n } Output: y {0, 1} Prediction y {0, 1} The goal is to choose y x to minimize inaccuracy as characterized by the misclassification risk 11

The goal is to choose y x to minimize inaccuracy as characterized by the misclassification risk Here l 0 and l 1 are the losses incurred for the respective misclassifications, 1 is an indicator function, f (x) is given by 12

The goal is to choose y x to minimize inaccuracy as characterized by the misclassification risk The misclassification risk (2.2) is minimized by the ( Bayes ) rule which achieves the lowest possible risk 13

The goal is to choose y x to minimize inaccuracy as characterized by the misclassification risk The training data set T is used to learn a classification rule y(x T ) for (future) prediction. The usual paradigm for accomplishing this is to use the training data T to form an approximation (estimate) f(x T ) to f (x) Regular function estimation technology can be applied to obtain the estimate f(x T ), which is plugged into (2.6) to form a classification rule, i.e., neural networks, decision tree induction methods, and nearest neighbor methods. 14

Classification Error Let l 0 = l 1 = 1, the misclassification risk is minimized by the ( Bayes ) rule which achieves the lowest possible risk y B x = 1 f(x) 1 2 Prediction: y x T = 1 f x T 1 2 If the prediction agrees with that of the Bayes rule: If not: Pr( y(x T ) y) = Pr(y B (x) y) = min[ f(x), 1 f(x)] Pr( y(x T) y) = max[ f(x), 1 f(x)] = 2f(x) 1 + Pr(y B (x) y) 15

Classification Error Therefore one has Averaging over all training samples T, under the assumption that they are drawn independently of future data to be predicted, one has 16

Classification Error y B x = 1 f(x) 1 2 y(x) = 1 f(x) 1 2 Pr( y y B ) is the only quantity that involves the probability estimate f 17

Classification Error P Y = 1 = 0.9, P Y = 0 = 0.1 if0 x 0.5 P Y = 1 = 0.1, P Y = 0 = 0.9 if 0.5 x 1 Sample 100 observations from each class, fit a linear reg model and evaluate the estimate at x = 0.48 y(x) = 1 f(x) 1 2 Pr y y B = 0.1, f = Pr y = 1 x = 0.48 = 0.9 18

Classification Error P Y = 1 = 0.9, P Y = 0 = 0.1 if0 x 0.5 P Y = 1 = 0.1, P Y = 0 = 0.9 if 0.5 x 1 Sample 100 observations from each class, fit a linear reg model and evaluate the estimate at x = 0.48 Pr y y B = 0.1, f = Pr y = 1 x = 0.48 = 0.9 Mean f = 0.5337 mean f < 0.5 = 0.0602 Pr y y = 2f 1 0.0602 + Pr y y B = 2(0.9) 1 0.0602 + 0.1 = 0.1482 20

Classification Error In order to gain some intuition we approximate p( f ) by a normal distribution 21

Classification Error y(x) = 1 f(x) 1 2 Boundary bias: No E f f expression Pr y y B = Φ b(f, E f) var f 22

Classification Error Pr y y B = Φ b(f, E f) var f No E f f expression For a given var f, so long as the boundary bias remains negative, the classification error decreases with increasing E f 1/2 irrespective of the estimation bias (E f f). For positive boundary bias, the classification error increases with the distance of E f from 1/2 For a given E f, so long as the boundary bias remains negative, the classification error decreases with decrease in variance. For a positive boundary bias, the error increases with decrease in variance. 1 negative boundary bias 2 small enough variance 23

Classification Error Pr y y B = Φ b(f, E f) var f No E f f expression The key thing to note is that our estimate E f may be off from f by a huge margin. It does not matter as long as we take care of the fact that we lie on the appropriate side of 1/2 and cut down our variance(negative b). 24

Estimation Error E T [y f x T)]2 = [f(x) E T f(x T)]2 + E T [ f(x T) E T f(x T)]2 + E ε [ε x] 2 Classification Error 1. Square of bias 2. Variance 3. Irreducible prediction error The bias variance trade off is clearly very different for classification error than estimation error on the probability function f itself. The dependency of squared estimation error on E f and var f is additive whereas for classification error, there is a strong multiplicative interaction effect. Certain methods that are inappropriate for function estimation because of their very high bias may perform well for classification when their estimates are used in the context of a classification rule. rule. All that is required is a negative boundary bias and small enough variance. The procedures where the bias is caused by over smoothing have negative boundary bias, i.e., Naïve Bayes method, KNN. 26

Table 1 shows the values of average squared estimation error (Column 2) and classification error (Column 4) as a function of training sample size N (first column) along with the corresponding optimal values (Ke and Kc, respectively) of the number of nearest neighbors (third and fifth columns) at n = 20 dimensions. One sees that classification error is decreasing at a much faster rate than squared estimation error as N increases. The optimal value of K for squared estimation error (third column) is seen to be very slowly increasing with N. 27

One sees that classification error is not completely immune to the tendency of K- nearest neighbor methods to degrade as irrelevant inputs are included. But whereas the squared estimation error degrades by over a factor of 35 as the number of irrelevant inputs is increased by a factor of 20, the corresponding increase in classification error is less than a factor of six. One sees that squared estimation error is increasing at a much faster rate than classification error as n increases. 28

Squared estimation error (upper) and classification error (lower) as a function of number of nearest neighbors K, for n = 20 dimensions and training sample size N = 3200. One sees that choice of number of nearest neighbors is less critical for classification error so long as K is neither too small nor too large (here 500 K 2000). Quite often when K-nearest neighbors are compared to other classification methods a small value is used. The simple example examined here suggests that, at least in some situations, this may underestimate the performance achievable with the K-nearest neighbor approach. 29

Much research in classification has been devoted to achieving higher accuracy probability estimates under the presumption that this will generally lead to more accurate predictions. This need not always be the case. 30