Ensemble Methods NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan
How do you make a decision? What do you want for lunch today?! What did you have last night?! What are your favorite foods?! Have you been to this restaurant before?! How expensive is it?! Are you on a diet?! Moral/ethical/religious concerns.
Decision Making Collaboration.!! Weighing disparate sources of evidence.
Ensemble Methods Ensemble Methods are based around the hypothesis that an aggregated decision from multiple experts can be superior to a decision from a single system.
Ensemble averaging Assume your prediction has noise in it y = f(x)+ N (0, y = 1 kx f(x)+ (i) k i On multiple iid observations, the epsilons will cancel out, leaving a better estimate of the signal 2 ) http://terpconnect.umd.edu/~toh/spectrum/signalsandnoise.html
Combining disparate evidence Early Fusion - combine features Acoustic Features Video Features Concatenate Classifier Lexical Features
Combining disparate evidence Late Fusion - combine predictions Acoustic Features Classifier Video Features Classifier Merge Lexical Features Classifier
Classifier Fusion Construct an answer from k predictions Test instance C 1 C 2 C 3 C 4
Classifier Fusion Construct an answer from k predictions Classifier Features Classifier Merge Classifier
Majority Voting Each Classifier generates a prediction and confidence score.! Chose the prediction that receives the most votes predictions from the ensemble Classifier Features Classifier Sum Classifier
Weighted Majority Voting! Most classifiers can be interpreted as delivering a distribution over predictions.! Rather than sum the number of votes, generate an average distribution from the sum.! This is the same as taking a vote where each prediction contributes its confidence. Classifier Features Classifier Weighted Sum Classifier
Sum, Max, Min Majority Voting can be viewed as summing the scores from each ensemble member.! Other aggregation functions can be used including:! maximum! minimum! What is the implication of these? Classifier Features Classifier Aggregator Classifier
Second-tier classifier Classifier predictions are used as input features for a second classifier.! How should the second tier classifier be trained? Classifier Features Classifier Classifer Classifier
Classifier Fusion Advantages! Experts to be trained separately on specialized data! Can be trained quicker, due to smaller data sets and feature space dimensionality.! Disadvantages! Interactions across feature sets may be missed! Explanation of how and why it works can be limited.
Bagging Bootstrap Aggregating! Train k models on different samples of the training data! Predict by averaging results of k models!! Simple instance of majority voting.
Model averaging Seen in Language Modeling! Take, for example a linear classifier y = f(w T x + b) Average model parameters. W = 1 k b = 1 k X i X i W i b i W = 1 k b = 1 k X W i + X b i + i i
Mixture of Experts Can we do better than averaging over all training points?! Look at the input data to see which points are better classified by which classifiers! Allow each expert to focus on those cases where it s already doing better than average
Mixture of Experts The array of P s are called a gating network.! Optimize p i as part of the loss function Classifier p Features Classifier p Classifer Classifier p
Probability Correct under a mixture of experts mixing coefficient for i on c p( d c MoG) = i p c i 1 1 d c o c e 2 i 2π 2 prob. desired output on c gaussian loss between desired and observed output From Hinton Lecture
Gating network Gradient ) ( 2 log ) ( log 2 2 1 2 2 1 2 2 1 1 2 c i c j c j o c d c j c i o c d c i c i c c i o c d i c i c o d e p e p o E e p MoE d p = = π posterior probability of expert i From Hinton Lecture
AdaBoost Adaptive Boosting! Construct an ensemble of weak classifiers.! Typically single split decision trees! Identify weights for each classifier.
Weak Classifiers Weak classifiers:! low performance (slightly better than chance)! high variance! (for adaboost) should have uncorrelated errors.
Boosting Hypothesis The existence of a weak learner implies the existence of a strong learner.
AdaBoost Decision Function C(x) = 1 C 1 (x)+ 2 C 2 (x)+...+ k C k (x) AdaBoost generates a prediction from a weighted sum of predictions of each classifier.! The weight training is The AdaBoost algorithm determines the different from weights.! any loss Similar to systems that use a function second tier classifier to learn a combination we ve function. used.
AdaBoost training algorithm Repeat! Identify the best unused classifier C i.! Assign it a weight based on its performance! Update the weights of each data point based on whether or not it is classified correctly! Until performance converges or all classifiers are included.
Identify the best classifier Generate hypotheses using each unused classifier.! Calculate weighted error using the current data point weights.! Data point weights are initialized to one. W e = X y i 6=k m (x i ) w (m) i How many errors were made
Generate a weight for the classifier ratio of error to previous iteration new weight e m = W m W m = 1 2 ln 1 e m em The larger the reduction in error, the larger the classifier weight
Data Point weighting If data point i was not correctly classified!!! w (m+1) i r = w (m) e i m = w (m) i em 1 e m >1 If data point i was correctly classified w (m+1) i r = w (m) e i m = w (m) 1 i e m em <1
AdaBoost training algorithm Repeat! Identify the best unused classifier C i.! Assign it a weight based on its performance! Update the weights of each data point based on whether or not it is classified correctly! Until performance converges or all classifiers are included.
Random Forests Random Forests are similar to AdaBoost decision trees. (sans adaptive training)! An ensemble of classifiers is trained each on a different subset of features and a different set of data points! Random subspace projection
Decision Tree world&state& is&it&raining? & no % yes % no % is&the&sprinkler&on? & yes % P(wet)& =&0.95 & P(wet)& =&0.1 & P(wet)& =&0.9 &
Construct a Forest of Tree tree"t 1 " tree"t T category"c
Training Algorithm Divide training data into K subsets of data points and M variables.! Improved Generalization! Reduced Memory requirements! Train a unique decision tree on each K set! Simple multi threading
Handling Class Imbalances Class imbalance, or skewed class distributions happen when there are not equal numbers of each label.! Class Imbalance provides a number of challenges! Density Estimation! low priors can lead to poor estimation of minority classes! Loss Functions! Since the loss of each point is equal, getting a lot of majority class points correct is important.! Evaluation! Accuracy is less informative.
Impact on Accuracy Example from Information Retrieval! Find 10 relevant documents from a set of 100. Accuracy = 90% True Values Positive Negative Hyp Values Positive 0 0 Negative 10 90
Contingency Table True Values Positive Negative Hyp Values Positive Negative True Positive False Negative False Positive True Negative Accuracy = TP + TN TP + FP + TN + FN
F-Measure Precision: how many hypothesized P = TP TP + FP events were true events Recall: how many of the true events were identified R = TP TP + FN F-Measure: Harmonic mean of precision and recall F = 2PR P + R Hyp Values True Values Positive Negative Positive 0 0 Negative 10 90
F-Measure F-measure can be weighted to favor Precision or Recall beta > 1 favors recall beta < 1 favors precision F = (1 + 2 )PR ( 2 P )+R
F-Measure Hyp Values True Values Positive Negative Positive 0 0 Negative 10 90 P =0 R =0 F 1 =0
F-Measure Hyp Values True Values Positive Negative Positive 10 50 Negative 0 40 P = 10 60 R = 1 F 1 =.29
F-Measure Hyp Values True Values Positive Negative Positive 9 1 Negative 1 89 P =.9 R =.9 F 1 =.9
ROC and AUC It is common to plot classifier performance at a variety of settings or thresholds Receiver Operating Characteristic (ROC) curves plot true positives against false positives. The overall performance is calculated by the Area Under the Curve (AUC)
Skew in Classifier Training Most classifiers train better with balanced training data.! Bayesian methods:! Reliance on a prior to weight classes.! Estimation of class conditioned density is impacted by skew in number of samples! Loss functions:! There is more pressure to set the decision boundary for the majority classes
Skew in Classifier Training
Skew in Classifier Training Twice as many errors Same distance from optimal decision boundary
Sampling Artificial manipulation of the number of training samples can help reduce the impact of class imbalance! Under sampling! Randomly select N m data points from the majority class for training
Sampling Oversampling! Reproduce the minority class points until the class sizes are balanced
Ensemble Sampling Ensemble Sampling! Repeat undersampling N M/Nm times with different samples of the majority class data points.! Train N M/Nm classifiers, combine with majority voting. C1 C2 Merge C3
Ensemble Methods Very simple and effective technique to improve classification performance! Netflix Prize, Watson, etc.! Mathematical justification! Intuitive appeal into how decisions are made by people and organizations! Can allow for modular training