Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

Size: px

Start display at page:

Download "Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon"

Jordan Fletcher
5 years ago
Views:

1 Lecture 2: Enseble Methods Isabelle Guyon Introduction Book Chapter 7 Weighted Majority Mixture of Experts/Coittee Assue K experts f, f 2, f K (base learners) x f (x) Each expert akes a decision f k (x) = ± Iprove predictions by aking the experts vote according to how good they are: x 2 f 2 (x) α α 2 Σ F(x) = Σ k α k f k (x) Decision: sign[ F( x) ] α k Σ k α k = x n f K (x) α K F(x) = Σ k α k f k (x)

2 Siple Exaples Kernel ethods: Each exaple is an expert f i (x) = y i k( x, x i ) Decision stups: Each variable is an expert f j (x) = x j ( orient variables s.t. [x, x 2, x ] y >) Each feature is an expert f j (x) = φ j (x) Bias-variance tradeoff salford-systes.co/doc/bias_variance_arcing.pdf D = one training set of size ( fixed) For the square loss: E D [f(x,d)-y] 2 = [E D f(x,d)-y] 2 + E D [f(x,d)-e D f(x,d)] 2 Expected value of the loss over datasets D of the sae size Bias 2 f(x,d) Variance Variance E D f(x,d) Bias 2 y target Bias: [E D f(x,d)-y] 2 Variance: E D [f(x,d)-e D f(x,d)] 2 E D f(x) : your ideal coittee achine. E D f(x) : your ideal coittee achine. Bias : what your ideal coittee can t learn (fro training exaples) E D f(x) has the sae bias as E D f(x) but no variance. Note: Each coittee eber was trained on a different set on exaples Variance : how far apart on average your solution f(x,d) is fro your ideal coittee achine. If the variance is high but the bias is low: there is hope that a coittee can iprove perforance. Note: Subsapling introduces extra bias 2

3 Feature Selection ) Mixture of decision stups feature selection 2) Merging expert feature rankings: Average ranking index: C j = Σ k α k C jk Average rank: C j = Σ k α k (R ax -R jk ) 3) Merging feature sets selected by experts: Ranking index: C j = Σ k α k δ jk (δ jk = if feat j selected by expert k, otherwise) S* = argax k in k S k S k (ost stable subset) R* = argin k ax k dist(r k,r k ) ables.htl 4) Sensitivity-based (special for bagging) Bayesian Approach Siple Justification Bayesian Methods F( x) = Σ k α k f k (x), α k, Σ k α k = P(y x,d) = Σ f P(f D) P(y x,f,d) Individual expert decisions: P(y x,f,d) Weights: P(f D) Risk=negative log posterior: P(f D) α exp(-r reg [f]/t) P(D f) P(f) α exp(-r ep [f]/t) exp(-λ f 2 /T) Success rate: P(f D) -R ep [f] P(y x,d) = Σ f P(f D) P(y x,f,d) P(y x,d) = f P(f D) P(y x,f,d) df P(f D) P(f* D) Df MAP approxiation P(y x,d) = P(y x,f*,d) P(f* D) f = f f* 3

4 Difficulties Iterative Sapling Continuous case: Infinitely any experts, we can t try the all! Idea: Let s take a saple How? Grid, heuristic search, stochastic search Iportant: Avoid sapling poor experts or redundant experts. CDF.5 PDF x Ideal x.5 CDF x Iteration Unifor PDF x MCMC R[f] P(f D) α exp(-r[f]/t) Siulated annealing: Make a rando step Accept with proba exp(-r[f]/t) Progressively decrease T (Metropolis-Hasting, ) Gibbs sapling: Investigate a bunch of nearby solutions f-space Saple according to local_su exp(-r[f]/t) Start over fro new point (Gean-Gean, 984) Variable-diension MCMC Vehtari and Lapinen, 22 Soe steps include reoval or addition of a feature We obtain P(odel,feature-subset D) for soe saples of odels and feature subsets Subset relevance can be coputed by arginalization (averaging over the functions using the sae subset) Feature relevance can also be coputed by arginalization (averaging over all subsets containing that feature) 4

5 Perforance Gain? If we draw M classifiers f k according to P(f D), we can approxiate P(y x,d) = f P(f D) P(y x,f,d) df by P(y x,d) ~ Σ k=:m P(y x,f k,d) Relative error difference with optiu Bayes classifier decays with O(/M) (Ng, Jordan, 2) Non-Bayesian Approaches Parallel ensebles: bagging Serial ensebles: boosting Bagging Bootstrap Aggregation: Breian, 996 Draw with replaceent saples fro the original training set of size Train a learning achine Repeat any tie On average, each exaple appears in the training set (-/) ~-e - ~.632 ties Rando Forests Breian, 2. A nuber n is specified uch saller than the total nuber N of variables (typically n ~ sqrt(n)) 2. Each tree of axiu depth is grown on a bootstrap saple of the training set 3. At each node, n variables are selected at rando out of the N 4. The split used is the best split on these n variables 5

6 Tree Classifiers CART (Breian, 984) or C4.5 (Quinlan, 993) Inforation Gain H before = -/9 log(/9) -8/9 log(8/9) =.98 f 2 All the data f 2 All the data Choose f Choose f 2 f At each step, choose the feature that reduces entropy ost. Work towards node purity. Choose f 2 /9 8/9 H left = -4/ log(4/) 7/ log(7/) =.94 f H right = -7/8 log(7/8) /8 log(/8) =.54 IG = H before (/9 H left + 8/9 H right ) = =.2 Ebedded Variable Scoring Iris Data IG t (f) = Inforation gain due to splitting with feature f at node t Ranking index: R(f) = Σ t IG t (f) Surrogate variables (detect asking) Use of M trees: R(f) = Σ T Σ t T IG t (f) 6

7 Sensitivity-based Scoring Breian, 2 Classify the OOB cases and count the nuber of votes cast for the correct class in every tree grown in the forest Randoly perute the values of feature f in the OOB cases and classify these cases down the trees Subtract the nuber of votes for the correct class in the feature-f peruted OOB data fro the untouched OOB data Average this nuber over all trees in the forest to obtain the iportance score R(f) Cross-validated Coittee Paranto et al., 996 Any learning achine Any ethod of splitting the (training) data any ties into training set and validation set (vset) Perturb feature f randoly in vset (pvset) R(f) = ean[ nu-correct-class(vset) - nu-correct-class(pvset) ] Zscore = R(f)/stderror Boosting Adaboost (Freund and Schapire, 996): At every step add a new base learner that is forced (by re-weighting the training data) to concentrate on isclassified exaples. Forward stagewise boosting (Breian, 997, Friedan et al., 2). Initialize F(x)= 2. For k= to M F(x) F(x) + α f(x) (α k, f k ) = argin α,f Σ i exp(-y i F(x i )) 3. Output F(x) = Σ k=:m α k f k (x) L(y, f(x)) 3.5 Adaboost loss e -z 2.5 Perceptron.5 loss logistic loss.5 log(+e -z ) ax(, -z) Loss Functions Decision boundary Margin issclassified SVC loss, β=2 ax(, (- z)) 2 / loss SVC loss, β= ax(, -z) well classified square loss (- z) 2 z=y f(x) 7

8 Conclusion Enseble ethods help reducing the variance They benefit ost to low bias base learners One should not confuse feature set variability and variance in predictions CV coittees allow to rank features according to sensitivity ans copute zscores. Exercise Class Arcene Boosting Arcene Forward Stagewise Boosting 4 2 ARCENE Best perforances:.9%.2 (use training set + validation set in training) with % of the features ( features) Baseline odel: 4.7%.4, features y_svc=svc({'coef=', 'degree=3', 'gaa=', 'shrinkage=.'}); y_odel=chain({standardize, s2n('f_ax='), noralize, y_svc}). Initialize F(x)= 2. For k= to M F(x) F(x) + α f(x) (α k, f k ) = argin α,f Σ i exp(-y i F(x i )) 3. Output F(x) = Σ k=:m α k f k (x) At step t: (α k, f k ) = argin α,f Σ i exp[-y i (F t- (x i ) + α f(x i ))] Copute α k, f k for decision stups 8

9 α=/2 log (-E(f))/E(f) Adaboost 9

Combining Classifiers

Combining Classifiers Cobining Classifiers Generic ethods of generating and cobining ultiple classifiers Bagging Boosting References: Duda, Hart & Stork, pg 475-480. Hastie, Tibsharini, Friedan, pg 246-256 and Chapter 10. http://www.boosting.org/