Classifier Performance. Assessment and Improvement

Size: px

Start display at page:

Download "Classifier Performance. Assessment and Improvement"

Jennifer Fitzgerald
5 years ago
Views:

1 Classifier Performance Assessment and Improvement

2 Error Rates Define the Error Rate function Q( ω ˆ,ω) = δ( ω ˆ ω) = 1 if ω ˆ ω = 0 0 otherwise When training a classifier, the Apparent error rate (or Test Error) is: E test = 1 N j=1:n Q( ω ˆ j ({ x i,ω i },x j ),ω j ) where x i are training features, and ω i training labels We want a general way to assess the Generalization error (or True error rate) E true = p( x,ω)q( ω ˆ (x,ω),ω) dx ω

3 Learning as Empirical Risk Minimization Optimal decisions can be formulated as minimizing the theoretical Risk R( ω ˆ ) = L( ω ˆ (x) ω j ) p(ω j,x) dx j=1:c But now our decision function is determined by parameters α ω ˆ (x) = g(x;α) [ e.g. α is the weight vector for Perceptron] This induces a loss function on α R(α) = j=1:c L( ˆ ω (x;α) ω j )P(ω j x)p(x) ˆ ω (x) = g(x;α) e.g. α is the weight vector for Perceptron dx [ ] Two problems: 1. TRUE P(ω j,x) is unknown 2. The estimate α(d) is typically a complicated random var. across test data sets, thus the Risk is too.

4 Learning as Empirical Risk Minimization Goal: Problem: R(α) = j=1:c L( ˆ ω (x;α) ω j )p(ω j,x) dx What we are faced with is an estimate from data : R est ( α ˆ D ) = L( ω ˆ (x; α ˆ D ) ω j ) p ˆ D (ω j,x) dx j=1:c R est ( ˆ α D ) p ˆ D (ω j,x) Where is an estimate of the true distribution based on data D, and α ˆ D is an estimate of the model parameters Will be termed the Estimated Risk (not standard terminology)

5 Interpreting Minimizing Training Error Derive Observed Risk R(α) = j=1:c L( ˆ ω (x;α) ω j )p(ω j,x) dx Replace the probability dist. with the sample : p(ω j,x) = i=1:n δ(x x i )δ(ω j ω i j ) Substituting in above (after integrating and summing): Observed Risk for 0-1 Loss is Error Rate R emp (α) = R obs (α) = 1 N i=1:n When the loss is making an error : L( ω ˆ (x i ;α) ω i j ) L( ω ˆ (x i ;α) ω i j ) = δ( ω ˆ (x i ;α) ω i j ) = 1 if i ω ˆ (x i ;α) = ω j 0 else R obs (α) = 1 N i=1:n δ( ω ˆ (x i ;α) ω i j )=#correct /Total

6 Implication: Minimizing the training error is equivalent to modeling p(x,ω) as a sum of delta functions Data is sampled from some unknown distribution Samples form empirical distribution i=1:n P(x) 1 1 F(x) F(x) = δ(x x i ) x

7 Need better estimator for the Empirical Risk Improvement comes from the way we use the data: 1. Parameter Estimate Approximation vs. estimation error Depends implicitly on: 2. Probability Density Estimate Resampling statistics: Methods to provide better estimates of the actual Risk by resampling from data. Bootstrap Jackknife Cross-validation Theoretical bounds: Based on better approximations to the true data distribution Support vector machines

8 Improving Parameter Estimates Goal: Parameter estimates with low Estimation error (VARIANCE): How far is the estimate from minimizing the Empirical Risk Approximation error (BIAS): How does the model constrain the best estimate from optimizing the true Risk

10 Bias-Variance Trade off For Regression (fitting continuous functions to data), there is a well known trade off in fit quality. Bias: For models with few parameters, there is a large approximation error. The true function is less likely to be in the set achievable via the model. Variance: For models with many parameters, there is a large estimation error. Test error variance across test samples will be finite.

11 P(error) Error

12 Find best balance between the two

14 Bias-Variance for Classifiers

15 Bias-Variance Decomp. for Classifiers Dichotomy not as clear for classifiers as regression. Topic of current research. Recent result: For training sets D = { D 1,L,D M } Main Estimate Bias Variance Noise ω ˆ m (D) = argmin ω ˆ E D Est. based on many Data sets L( ˆ ω,ω) [ ] * B(x i ) = L( ω ˆ m (x i ),ω best (x i )) Error from to best estimate V (x i ) = E D N(x i ) = E ω L( ω ˆ m (x i ), ω ˆ (x i )) [ ] Error from single Data set est. to main [ * L(ω best (x i ),ω true )] Error Decomp E Error(x i ) [ ] = c 1 N(x i ) + B(x i ) + c 2 V (x i ) ω best * (x i ) = Optimal Prediction Errors you can t avoid Domingos, P. (2000) A Unified Bias-Variance Decomposition for Zero-One and Squared Loss. Proceedings AAAI.

Schematic of bias and variance. The model space is the set of all possible predictions from the model, with the closest fit" labeled with a black dot.

16 Schematic of bias and variance. The model space is the set of all possible predictions from the model, with the closest fit" labeled with a black dot. The model bias from the truth is shown, along with the variance, indicated by the large yellow circle centered at the black dot labeled closest fit in population". A shrunken or regularized fit is also shown, having more est. bias, but smaller prediction error due to its decreased variance.

17 Solutions

18 Roadmap Methods to estimate generalization error from training sample. Methods for improving classifier performance: Choosing minimal gen. Error rate classifier Committee decisions (voting) Averaging classifiers Bagging Boosting

19 Cross-Validation

20 Bootstrap x 5 x 3 x 2 x 9 x 3

21 Bootstrap

23 (Bootstrap Aggregation) 1. Choose BEST 2. Committee vote 3. Average

25 Combining Classifiers Goal: generate a set of simple weak classification methods and combine them into a single strong method Solution: Average multiple classifiers by estimate of their reliability. Combine the discriminant functions additively so that the final classifier is the sign of g ˆ (x) = α h (x; w ˆ ) +L+ α h (x; w ˆ ) m m m m where the votes α i emphasize component classifiers that make more reliable predictions than others

26 Adaboost

27 General Algorithm

28 Adaboost Algorithm

29 Adaboost Example

32 Adaboost Summary

Machine Learning. Ensemble Methods. Manfred Huber

Machine Learning. Ensemble Methods. Manfred Huber Machine Learning Ensemble Methods Manfred Huber 2015 1 Bias, Variance, Noise Classification errors have different sources Choice of hypothesis space and algorithm Training set Noise in the data The expected