Combining Estimators to Improve Performance

Size: px

Start display at page:

Download "Combining Estimators to Improve Performance"

Ethelbert Pierce
6 years ago
Views:

1 Combining Estimators to Improve Performance A survey of model bundling techniques -- from boosting and bagging, to Bayesian model averaging -- creating a breakthrough in the practice of Data Mining. John F. Elder IV, Ph.D. Elder Research, Charlottesville, Virginia Greg Ridgeway, Ph.D. University of Washington, Dept. of Statistics Elder & Ridgeway KDD99 T5-1 Combining Estimators

2 Outline Why combine? A motivating example Hidden dangers of model selection Reducing modeling uncertainty through Bayesian Model Averaging Stabilizing predictors through bagging Improving performance through boosting Emerging theory illuminates empirical success Bundling, in general Latest algorithms Closing Examples & Summary 1999 Elder & Ridgeway KDD99 T5-2 Combining Estimators

3 Reasons to combine estimators Decreases variability in the predictions. Accounts for uncertainty in the model class. > Improved accuracy on new data Elder & Ridgeway KDD99 T5-3 Combining Estimators

4 A Motivating Example: Classifying a bat s species from its chirp Goal: Use time-frequency features of echolocation signals to classify bats by species in the field (avoiding capture and physical inspection). U. Illinois biologists gathered data: 98 signals from 19 bats representing 6 species: Southeastern, Grey, Little Brown, Indiana, Pipistrelle, Big-Eared. ~35 data features (dimensions) calculated from signals, such as low frequency at the 3db level, time position of the signal peak, and amplitude ratio of 1st and 2nd harmonics. Turned out to have a nice level of difficulty for comparing methods: overlap in classes, but some separability Elder & Ridgeway KDD99 T5-4 Combining Estimators

5 Sample Projection Var4 t10 bw 1999 Elder & Ridgeway KDD99 T5-5 Combining Estimators

6 % 88 Percent Classified Correctly training time: TX2step (11-12) CART (7) ASPN (15) 8-N.net (14) (tree vars) 17-N.net Visualization & multicorrelation 35-N.net Optimal (of 8) dimensions Near.n. 8-Near.n (13) [Fit](16) prop. Avg. Avg. Trees Polynomial Neural Nearest y ˆ Networks Networks Neighbor 1:seconds hour day ε combinations 2:20 mins 1999 Elder & Ridgeway KDD99 T5-6 Combining Estimators Bundling 4 Old Fit Advisor Perceptrons (16) enabled

7 What is model uncertainty? Suppose we wish to predict y from predictors x. Given a dataset of observations, D, for a new observation with predictors x * we want to derive the predictive distribution of y * given x * and D. P( y x, D) 1999 Elder & Ridgeway KDD99 T5-7 Combining Estimators

8 In practice Although we want to use all the information in D to make the best estimate of y * for an individual with covariates x * P( y x, D) In practice, however, we always use P( y x, M where M is a model constructed from D. ) 1999 Elder & Ridgeway KDD99 T5-8 Combining Estimators

9 Selecting M The process of selecting a model usually involves Model class selection Linear regression, tree regression, neural network Variable selection variable exclusion, transformation, smoothing Parameter estimation We tend to choose the one model that fits the data or performs best as the model Elder & Ridgeway KDD99 T5-9 Combining Estimators

10 What s wrong with that? Two models may equally fit a dataset (with repect to some loss) but have different predictions. Competing interpretable models with equivalent performance offer ambiguious conclusions. Model search dilutes the evidence. Part of the evidence is spent specifying the model Elder & Ridgeway KDD99 T5-10 Combining Estimators

11 Bayesian Model Averaging Goal: Account for model uncertainty Method: Use Bayes Theorem and average the models by their posterior probabilities Properties: Improves predictive performance Theoretically elegant Computationally costly 1999 Elder & Ridgeway KDD99 T5-11 Combining Estimators

12 Averaging the models Consider a set containing the K candidate models M 1,, M K. With a few probability manipulations we can make predictions using all of them. P( y * x *, D) * * = k k k P( y x, M ) P( M D) The probability mass for a particular prediction value of y is a weighted average of the probability mass that each model places on that value of y. The weight is based on the posterior probability of that model given the data Elder & Ridgeway KDD99 T5-12 Combining Estimators

13 1999 Elder & Ridgeway KDD99 T5-13 Combining Estimators Bayes Theorem M k - model D - data P(D M k ) - integrated likelihood of M k P(M k ) - prior model probability = = K l l l k k k M P M D P M P M D P D M P 1 ) ( ) ( ) ( ) ( ) (

14 Challenges The size of the model set may cause exhaustive summation to be impossible. The integrated likelihood of each model is usually complex. Specifying a prior distribution (even a noninformative one) across the space of models is non-trivial. Proposed solutions to these challenges often involve MCMC, BIC approximation, MLE approximation, Occam s window, Occam s razor Elder & Ridgeway KDD99 T5-14 Combining Estimators

15 Performance Survival model: Primary biliary cirrhosis BMA vs. Stepwise regression 2% improvement BMA vs. expert selected model 10% improvement Linear regression: Body fat prediction BMA provides best 90% predictive coverage. Graphical models BMA yields an improvement 1999 Elder & Ridgeway KDD99 T5-15 Combining Estimators

16 BMA References Chris Volinsky s BMA homepage J. Hoeting, D. Madigan, A. Raftery, C. Volinsky (1999). Bayesian Model Averaging: A Practical Tutorial (to appear in Statistical Science), Elder & Ridgeway KDD99 T5-16 Combining Estimators

17 Unstable predictors We can always assume y = f ( x) + ε, where E( ε x) = 0 Assume that we have a way of constructing a predictor, f ˆ ( x), from a dataset D. f D We want to choose the estimator of f that minimizes J, squared loss for example. J ( fˆ, D) = E ˆ, ( y f ( x)) y x 1999 Elder & Ridgeway KDD99 T5-17 Combining Estimators D 2

18 Bias-variance decomposition If we could average over all possible datasets, let the average prediction be f ( x) = E f ˆ ( x) The average prediction error over all datasets that we might see is decomposable D D E D J ( fˆ, D) 2 2 = Eε + E ( f ( x) f ( x)) + E ( fˆ ( x) f ( x)) x = noise + bias + variance x, D D Elder & Ridgeway KDD99 T5-18 Combining Estimators

19 Bias-variance decomposition (cont.) E D J ( fˆ, D) 2 2 = Eε + E ( f ( x) f ( x)) + E ( fˆ ( x) f ( x)) x = noise + bias + variance x, D D 2 The noise cannot be reduced. The squared-bias term might be reducible The variance term is 0 if we use fˆ D ( x) = f ( x) But this requires having an infinite number of datasets 1999 Elder & Ridgeway KDD99 T5-19 Combining Estimators

20 Bagging (Bootstrap Aggregating) Goal: Variance reduction Method: Create bootstrap replicates of the dataset and fit a model to each. Average the predictions of each model. Properties: Stabilizes unstable methods Easy to implement, parallelizable Theory is not fully explained 1999 Elder & Ridgeway KDD99 T5-20 Combining Estimators

21 Bagging algorithm 1. Create K bootstrap replicates of the dataset. 2. Fit a model to each of the replicates. 3. Average (or vote) the predictions of the K models. Bootstrapping simulates the stream of infinite datasets in the bias-variance decomposition Elder & Ridgeway KDD99 T5-21 Combining Estimators

22 Bagging Example x Elder & Ridgeway KDD99 T5-22 Combining Estimators x1

23 CART decision boundary Elder & Ridgeway KDD99 T5-23 Combining Estimators

24 100 bagged trees Elder & Ridgeway KDD99 T5-24 Combining Estimators

25 Bagged tree decision boundary Elder & Ridgeway KDD99 T5-25 Combining Estimators

26 Regression results Squared error loss CART Bagged CART Boston Housing Ozone Friedman #1 Friedman #2 Friedman # Elder & Ridgeway KDD99 T5-26 Combining Estimators

27 Classification results Misclassification rates CART Bagged CART Diabetes Breast Ionosphere Heart Soybean Glass Waveform 1999 Elder & Ridgeway KDD99 T5-27 Combining Estimators

28 Bagging References Leo Breiman s homepage Breiman, L. (1996) Bagging Predictors, Machine Learning, 26:2, Friedman, J. and P. Hall (1999) On Bagging and Nonlinear Estimation Elder & Ridgeway KDD99 T5-28 Combining Estimators

29 Boosting Goal: Improve misclassification rates Method: Sequentially fit models, each more heavily weighting those observations poorly predicted by the previous model Properties: Bias and variance reduction Easy to implement Theory is not fully (but almost) explained 1999 Elder & Ridgeway KDD99 T5-29 Combining Estimators

30 Origin of Boosting Classification problems {y, x} i, i = 1,,n y {0, 1} The task - construct a function, F(x) : x {0, 1} so that F minimizes misclassification error Elder & Ridgeway KDD99 T5-30 Combining Estimators

31 Generic boosting algorithm Equally weight the observations (y,x) i For t in 1,,T Using the weights, fit a classifier f t (x) y Upweight the poorly predicted observations Downweight the well-predicted observations Merge f 1,,f T to form the boosted classifier 1999 Elder & Ridgeway KDD99 T5-31 Combining Estimators

32 Real AdaBoost Schapire & Singer 1998 y i {-1,1}, w i = 1/N For t in 1,,T do 1. Estimate P w (y = 1 x). Pˆ ( y = 1 x) 1 w 2. Set f ( x) = 2 log t Pˆ ( y = 1 x) w 3. w w exp y f ( x ) and renormalize i i ( ) i t i Output the classifier F ( x) = sign f ( x) t 1999 Elder & Ridgeway KDD99 T5-32 Combining Estimators

33 AdaBoost s Performance Freund & Schapire [1996] Leo Breiman - AdaBoost with trees is the best off-the-shelf classifier in the world. Performs well with many base classifiers and in a variety of problem domains. AdaBoost is generally slow to overfit. Boosted naïve Bayes tied for first place in the 1997 KDD Cup. (Elkan [1997]) Boosted naïve Bayes is a scalable, interpretable classifier (Ridgeway, et al [1998]) Elder & Ridgeway KDD99 T5-33 Combining Estimators

34 Boosting Example x Elder & Ridgeway KDD99 T5-34 Combining Estimators x1

35 After one iteration CART splits, larger points have great weight x Elder & Ridgeway KDD99 T5-35 Combining Estimators x1

36 After 3 iterations x Elder & Ridgeway KDD99 T5-36 Combining Estimators x1

37 After 20 iterations x Elder & Ridgeway KDD99 T5-37 Combining Estimators x1

38 Decision boundary after 100 iterations Elder & Ridgeway KDD99 T5-38 Combining Estimators

39 Boosting as optimization Friedman, Hastie, Tibshirani [1998] - AdaBoost is an optimization method for finding a classifier. Let y {-1,1}, F(x) (-, ) J ( ) ( ) F = E e yf ( x) x 1999 Elder & Ridgeway KDD99 T5-39 Combining Estimators

40 Criterion E(e yf(x) ) bounds the misclassification rate. I( yf( x) < 0) < e yf ( x) The minimizer of E(e yf(x) ) coincides with the maximizer of the expected Bernoulli likelihood. E ( ) 2 yf ( x) l( p( x), y) = E log(1 + e ) 1999 Elder & Ridgeway KDD99 T5-40 Combining Estimators

41 1999 Elder & Ridgeway KDD99 T5-41 Combining Estimators Optimization step Select f to minimize J ( ) ( ) ( ) x e E f F J x f x F y ) ( ) ( + = + ) ( 2 1 ) ( 1) ( ) ( ), ( ] 1) ( [ 1 ] 1) ( [ log x yf w w t t t e y x w x y I E x y I E F F + = = = +

42 1999 Elder & Ridgeway KDD99 T5-42 Combining Estimators LogitBoost Friedman, Hastie, Tibshirani [1998] Logistic regression Expected log-likelihood of a regressor, F(x) = ) ( 1 with probability 0 ) ( with probability 1 x p x p y ( ) x e x yf F x F ) log(1 ) ( E ) ( E ) ( + = l ) ( 1 1 ) ( x F e x p + =

43 1999 Elder & Ridgeway KDD99 T5-43 Combining Estimators Newton steps Iterate to optimize expected log-likelihood. 0 ) ( 0 ) ( ) ( 1) ( ) ( ) ( ) ( ) ( 2 2 = = f t f f t f t t f F J f F J x F x F ( ) x e x f x F y E f F J x f x F ) log(1 )) ( ) ( ( ) ( ) ( ) ( = +

44 LogitBoost, continued Newton steps for Bernoulli likelihood F( x) y p( x) F( x) + Ew p( x)(1 p( x)) w( x) = p( x)(1 p( x)) In practice the E w ( x) can be any regressor - trees, smoothers, etc. x Trees are adaptive and work well for high dimensional data Elder & Ridgeway KDD99 T5-44 Combining Estimators

45 Misclassification rates Friedman, Hastie, Tibshirani [1998] CART AdaBoost CART LogitBoost CART Breast Ionosphere Glass Sonar Waveform 1999 Elder & Ridgeway KDD99 T5-45 Combining Estimators

46 Boosting References Rob Schapire s homepage Freund, Y. and R. Schapire (1996). Experiments with a new boosting algorithm, Machine Learning: Proceedings of the 13 th International Conference, Jerry Friedman s homepage Friedman, J., T. Hastie, R. Tibshirani (1998). Additive Logistic Regression: a statistical view of boosting, Technical report, Statistics Department, Stanford University Elder & Ridgeway KDD99 T5-46 Combining Estimators

47 In general, combining ( bundling ) estimators consists of two steps: 1) Constructing varied models, and 2) Combining their estimates Generate component models by varying: Case Weights Data Values Guiding Parameters Variable Subsets Combine estimates using: Estimator Weights Voting Advisor Perceptrons Partitions of Design Space, X 1999 Elder & Ridgeway KDD99 T5-47 Combining Estimators

48 Other Bundling Techniques We ve Examined: Bayesian Model Averaging: sum estimates of possible models, weighted by posterior evidence Bagging (Breiman 96) (bootstrap aggregating) -- bootstrap data (to build trees mostly); take majority vote or average Boosting (Freund & Shapire 96) -- weight error cases by β t = (1-e(t))/e(t), iteratively re-model; average, weighing model t by ln(β t ) Additional Example Techniques: GMDH (Ivakhenko 68) -- multiple layers of quadratic polynomials, using two inputs each, fit by Linear Regression Stacking (Wolpert 92) -- train a 2nd-level (LR) model using leave-1-out estimates of 1st-level (neural net) models ARCing (Breiman 96) (Adaptive Resampling and Combining) -- Bagging with reweighting of error cases; superset of boosting Bumping (Tibshirani 97) -- bootstrap, select single best Crumpling (Anderson & Elder 98) -- average cross-validations Born-Again (Breiman 98) -- invent new X data Elder & Ridgeway KDD99 T5-48 Combining Estimators

49 Group Method of Data Handling Layer 1 (GMDH) a b A 0 + A 0 + A 1 a + A 1 a + A 2 b 2 b + A + A 3 a 2 3 a 2 + A + A 4 ab 4 ab + A + A 5 b 2 5 b 2 z 1 Layer 2 Layer 3 c d B 0 + B 0 + B 1 c + B 1 c + B 2 d 2 d + B + B 3 c 2 3 c 2 + B + B 4 cd 4 cd + B + B 5 d 2 5 d 2 z 2 z 4 y est e f C 0 + C 0 + C 1 e + C 1 e + C 2 f 2 f + C + C 3 e 2 3 e 2 + C + C 4 ef 4 ef + C + C 5 f 2 5 f 2 z 3 z 5 Try all pairs of variables (K choose 2) in quadratic polynomial nodes. Fit coefficients using regression. Keep best M nodes. Train model on one training data set, score on test data set. (Need a third data set for independent confirmation of model.) 1999 Elder & Ridgeway KDD99 T5-49 Combining Estimators

50 Polynomial Networks (ASPN) Z 17 = a -.15b bc abc + 0.5c 3 Layer 0 (Normalizers) Layer 1 a N1 z1 z9 Double 16 k N9 z6 Single 14 f N6 d N4 z4 Triple 17 z16 z14 z17 Layer 2 Multi- Linear 15 z15 Layer 3 Triple 21 z21 Unitizers U2 Y1 h N8 z8 Double 20 z20 Double 19 z19 U7 Y2 e N5 z Elder & Ridgeway KDD99 T5-50 Combining Estimators

51 When does Bundling work? Hypotheses: Breiman (1996): when the prediction method is unstable (significantly different models are constructed) Ali & Pazzani (1996): when there is low noise, lots of irrelevant variables, and good individual predictors which make different errors when models are slightly overfit when models are from different families 1999 Elder & Ridgeway KDD99 T5-51 Combining Estimators

52 Advanced techniques Stochastic gradient boosting Adaptive bagging Example regression and classification results 1999 Elder & Ridgeway KDD99 T5-52 Combining Estimators

53 Stochastic Gradient Boosting Goal: Non-parametric function estimation Method: Cast the problem as optimization and use gradient ascent to obtain predictor Properties: Bias and variance reduction Widely applicable Can make use of existing algorithms Many tuning parameters 1999 Elder & Ridgeway KDD99 T5-53 Combining Estimators

54 Improving boosting Boosting usually has the form F ( t+ 1) Improve by... ( x) ( x) + λe Sub-sampling a fraction of the data at each step when computing the expectation. Robustifying the expectation. F ( t) ( z( y, x) x) Trimming observations with small weights. w 1999 Elder & Ridgeway KDD99 T5-54 Combining Estimators

55 Stochastic gradient boosting offers... Application to likelihood based models (GLM, Cox models) Bias reduction - non-linear fitting Massive datasets - bagging, trimming Variance reduction - bagging Interpretability - additive models High-dimensional regression - trees Robust regression 1999 Elder & Ridgeway KDD99 T5-55 Combining Estimators

56 SGB References Friedman, J. (1999). Greedy function approximation: a gradient boosting machine, Technical report, Dept. of Statistics, Stanford University. Friedman, J. (1999). Stochastic gradient boosting, Technical report, Dept. of Statistics, Stanford University Elder & Ridgeway KDD99 T5-56 Combining Estimators

57 Adaptive Bagging Goal: Bias and variance reduction Method: Sequentially fit bagged models, where each fits the current residuals Properties: Bias and variance reduction No tuning parameters 1999 Elder & Ridgeway KDD99 T5-57 Combining Estimators

58 Adaptive bagging algorithm 1. Fit a bagged regressor to the dataset D. 2. Predict out-of-bag observations. 3. Fit a new bagged regressor to the bias (error) and repeat. For a new observation, sum the predictions from each stage Elder & Ridgeway KDD99 T5-58 Combining Estimators

59 Regression results Squared error loss Bagging Adaptive bagging Abalone Robot arm Peak20 Boston Housing Ozone Servo F #1 F #2 F # Elder & Ridgeway KDD99 T5-59 Combining Estimators

60 Classification results Misclassification rates AdaBoost Adaptive bagging Breast Ionosphere Sonar Heart German credit Votes Liver Diabetes 1999 Elder & Ridgeway KDD99 T5-60 Combining Estimators

61 Relative Performance Examples: 5 Algorithms on 6 Datasets (John Elder, Elder Research & Stephen Lee, U. Idaho, 1997) Neural Network Logistic Regression Linear Vector Quantization Projection Pursuit Regression Decision Tree Diabetes Gaussian Hypothyroid German Credit Waveform Investment 1999 Elder & Ridgeway KDD99 T5-61 Combining Estimators

62 1.00 Essentially every Bundling method improves performance Advisor Perceptron AP weighted average Vote Average Diabetes Gaussian Hypothyroid German Credit Waveform Investment 1999 Elder & Ridgeway KDD99 T5-62 Combining Estimators

63 Application Ex.: Direct Marketing (Elder Research ) Model respondants to direct marketing as binary variable: 0 (no response), 1 (response). Create models using several (here, 5) different algorithms, all employing the same candidate model inputs. Rank-order model responses: Give highest-probability response value a rank of 1, second highest value 2, etc. For bundling, combine model ranks (not estimates) into a new consensus estimate (which is again ranked). Report number of response cases missed (in top portion) Elder & Ridgeway KDD99 T5-63 Combining Estimators

64 80 Marketing Application Performance 75 Bundled Trees NT SNT #Cases Missed Stepwise Regression Polynomial Network Neural Network MARS NS MT PT MS MP ST PS NP MN SPT PNT MPT SMN SMT SPN MNT SMP SPNT SMPT SMNT SMPN MPNT 55 MPN #Models Combined (averaging output rank) 1999 Elder & Ridgeway KDD99 T5-64 Combining Estimators

65 Median (and Mean) Error Reduced with each Stage of Combination M i s s e d NOTE: Fewer misses is better No. Models in combination 1999 Elder & Ridgeway KDD99 T5-65 Combining Estimators

66 Why Bundling works (semi-) Independent Estimators Bayes Rule - weighing evidence Shrinking (ex.: stepwise LR) Smoothing (ex.: decision trees)...and in a multitude of counselors there is safety. Proverbs 24:6b Additive modeling and maximum likelihood (Friedman, Hastie, & Tibshirani 8/20/98) Open research area. Meanwhile, we recommend bundling competing candidate models both within, and between, model families Elder & Ridgeway KDD99 T5-66 Combining Estimators

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12 Ensemble Methods Charles Sutton Data Mining and Exploration Spring 2012 Bias and Variance Consider a regression problem Y = f(x)+ N(0, 2 ) With an estimate regression function ˆf, e.g., ˆf(x) =w > x Suppose