Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Ensemble Methods NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

How do you make a decision? What do you want for lunch today?! What did you have last night?! What are your favorite foods?! Have you been to this restaurant before?! How expensive is it?! Are you on a diet?! Moral/ethical/religious concerns.

Decision Making Collaboration.!! Weighing disparate sources of evidence.

Ensemble Methods Ensemble Methods are based around the hypothesis that an aggregated decision from multiple experts can be superior to a decision from a single system.

Ensemble averaging Assume your prediction has noise in it y = f(x)+ N (0, y = 1 kx f(x)+ (i) k i On multiple iid observations, the epsilons will cancel out, leaving a better estimate of the signal 2 ) http://terpconnect.umd.edu/~toh/spectrum/signalsandnoise.html

Combining disparate evidence Early Fusion - combine features Acoustic Features Video Features Concatenate Classifier Lexical Features

Combining disparate evidence Late Fusion - combine predictions Acoustic Features Classifier Video Features Classifier Merge Lexical Features Classifier

Classifier Fusion Construct an answer from k predictions Test instance C 1 C 2 C 3 C 4

Classifier Fusion Construct an answer from k predictions Classifier Features Classifier Merge Classifier

Majority Voting Each Classifier generates a prediction and confidence score.! Chose the prediction that receives the most votes predictions from the ensemble Classifier Features Classifier Sum Classifier

Weighted Majority Voting! Most classifiers can be interpreted as delivering a distribution over predictions.! Rather than sum the number of votes, generate an average distribution from the sum.! This is the same as taking a vote where each prediction contributes its confidence. Classifier Features Classifier Weighted Sum Classifier

Sum, Max, Min Majority Voting can be viewed as summing the scores from each ensemble member.! Other aggregation functions can be used including:! maximum! minimum! What is the implication of these? Classifier Features Classifier Aggregator Classifier

Second-tier classifier Classifier predictions are used as input features for a second classifier.! How should the second tier classifier be trained? Classifier Features Classifier Classifer Classifier

Classifier Fusion Advantages! Experts to be trained separately on specialized data! Can be trained quicker, due to smaller data sets and feature space dimensionality.! Disadvantages! Interactions across feature sets may be missed! Explanation of how and why it works can be limited.

Bagging Bootstrap Aggregating! Train k models on different samples of the training data! Predict by averaging results of k models!! Simple instance of majority voting.

Model averaging Seen in Language Modeling! Take, for example a linear classifier y = f(w T x + b) Average model parameters. W = 1 k b = 1 k X i X i W i b i W = 1 k b = 1 k X W i + X b i + i i

Mixture of Experts Can we do better than averaging over all training points?! Look at the input data to see which points are better classified by which classifiers! Allow each expert to focus on those cases where it s already doing better than average

Mixture of Experts The array of P s are called a gating network.! Optimize p i as part of the loss function Classifier p Features Classifier p Classifer Classifier p

Probability Correct under a mixture of experts mixing coefficient for i on c p( d c MoG) = i p c i 1 1 d c o c e 2 i 2π 2 prob. desired output on c gaussian loss between desired and observed output From Hinton Lecture

Gating network Gradient ) ( 2 log ) ( log 2 2 1 2 2 1 2 2 1 1 2 c i c j c j o c d c j c i o c d c i c i c c i o c d i c i c o d e p e p o E e p MoE d p = = π posterior probability of expert i From Hinton Lecture

AdaBoost Adaptive Boosting! Construct an ensemble of weak classifiers.! Typically single split decision trees! Identify weights for each classifier.

Weak Classifiers Weak classifiers:! low performance (slightly better than chance)! high variance! (for adaboost) should have uncorrelated errors.

Boosting Hypothesis The existence of a weak learner implies the existence of a strong learner.

AdaBoost Decision Function C(x) = 1 C 1 (x)+ 2 C 2 (x)+...+ k C k (x) AdaBoost generates a prediction from a weighted sum of predictions of each classifier.! The weight training is The AdaBoost algorithm determines the different from weights.! any loss Similar to systems that use a function second tier classifier to learn a combination we ve function. used.

AdaBoost training algorithm Repeat! Identify the best unused classifier C i.! Assign it a weight based on its performance! Update the weights of each data point based on whether or not it is classified correctly! Until performance converges or all classifiers are included.

Identify the best classifier Generate hypotheses using each unused classifier.! Calculate weighted error using the current data point weights.! Data point weights are initialized to one. W e = X y i 6=k m (x i ) w (m) i How many errors were made

Generate a weight for the classifier ratio of error to previous iteration new weight e m = W m W m = 1 2 ln 1 e m em The larger the reduction in error, the larger the classifier weight

Data Point weighting If data point i was not correctly classified!!! w (m+1) i r = w (m) e i m = w (m) i em 1 e m >1 If data point i was correctly classified w (m+1) i r = w (m) e i m = w (m) 1 i e m em <1

Random Forests Random Forests are similar to AdaBoost decision trees. (sans adaptive training)! An ensemble of classifiers is trained each on a different subset of features and a different set of data points! Random subspace projection

Decision Tree world&state& is&it&raining? & no % yes % no % is&the&sprinkler&on? & yes % P(wet)& =&0.95 & P(wet)& =&0.1 & P(wet)& =&0.9 &

Construct a Forest of Tree tree"t 1 " tree"t T category"c

Training Algorithm Divide training data into K subsets of data points and M variables.! Improved Generalization! Reduced Memory requirements! Train a unique decision tree on each K set! Simple multi threading

Handling Class Imbalances Class imbalance, or skewed class distributions happen when there are not equal numbers of each label.! Class Imbalance provides a number of challenges! Density Estimation! low priors can lead to poor estimation of minority classes! Loss Functions! Since the loss of each point is equal, getting a lot of majority class points correct is important.! Evaluation! Accuracy is less informative.

Impact on Accuracy Example from Information Retrieval! Find 10 relevant documents from a set of 100. Accuracy = 90% True Values Positive Negative Hyp Values Positive 0 0 Negative 10 90

Contingency Table True Values Positive Negative Hyp Values Positive Negative True Positive False Negative False Positive True Negative Accuracy = TP + TN TP + FP + TN + FN

F-Measure Precision: how many hypothesized P = TP TP + FP events were true events Recall: how many of the true events were identified R = TP TP + FN F-Measure: Harmonic mean of precision and recall F = 2PR P + R Hyp Values True Values Positive Negative Positive 0 0 Negative 10 90

F-Measure F-measure can be weighted to favor Precision or Recall beta > 1 favors recall beta < 1 favors precision F = (1 + 2 )PR ( 2 P )+R

F-Measure Hyp Values True Values Positive Negative Positive 0 0 Negative 10 90 P =0 R =0 F 1 =0

F-Measure Hyp Values True Values Positive Negative Positive 10 50 Negative 0 40 P = 10 60 R = 1 F 1 =.29

F-Measure Hyp Values True Values Positive Negative Positive 9 1 Negative 1 89 P =.9 R =.9 F 1 =.9

ROC and AUC It is common to plot classifier performance at a variety of settings or thresholds Receiver Operating Characteristic (ROC) curves plot true positives against false positives. The overall performance is calculated by the Area Under the Curve (AUC)

Skew in Classifier Training Most classifiers train better with balanced training data.! Bayesian methods:! Reliance on a prior to weight classes.! Estimation of class conditioned density is impacted by skew in number of samples! Loss functions:! There is more pressure to set the decision boundary for the majority classes

Skew in Classifier Training

Skew in Classifier Training Twice as many errors Same distance from optimal decision boundary

Sampling Artificial manipulation of the number of training samples can help reduce the impact of class imbalance! Under sampling! Randomly select N m data points from the majority class for training

Sampling Oversampling! Reproduce the minority class points until the class sizes are balanced

Ensemble Sampling Ensemble Sampling! Repeat undersampling N M/Nm times with different samples of the majority class data points.! Train N M/Nm classifiers, combine with majority voting. C1 C2 Merge C3

Ensemble Methods Very simple and effective technique to improve classification performance! Netflix Prize, Watson, etc.! Mathematical justification! Intuitive appeal into how decisions are made by people and organizations! Can allow for modular training