Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Similar documents
Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Machine Learning. Ensemble Methods. Manfred Huber

VBM683 Machine Learning

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Stephen Scott.

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Decision Trees: Overfitting

Variance Reduction and Ensemble Methods

ECE 5424: Introduction to Machine Learning

Performance Evaluation

Performance Evaluation and Comparison

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

CS7267 MACHINE LEARNING

ECE 5984: Introduction to Machine Learning

TDT4173 Machine Learning

Data Warehousing & Data Mining

Hypothesis Evaluation

FINAL: CS 6375 (Machine Learning) Fall 2014

Chapter 14 Combining Models

Statistical Machine Learning from Data

Boosting & Deep Learning

Voting (Ensemble Methods)

Ensembles. Léon Bottou COS 424 4/8/2010

Algorithm-Independent Learning Issues

Evaluation requires to define performance measures to be optimized

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Learning with multiple models. Boosting.

IMBALANCED DATA. Phishing. Admin 9/30/13. Assignment 3: - how did it go? - do the experiments help? Assignment 4. Course feedback

Ensembles of Classifiers.

Data Mining and Analysis: Fundamental Concepts and Algorithms

Neural Networks and Ensemble Methods for Classification

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Learning theory. Ensemble methods. Boosting. Boosting: history

Evaluation. Andrea Passerini Machine Learning. Evaluation

Generalization, Overfitting, and Model Selection

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Applied Machine Learning Annalisa Marsico

Bagging and Other Ensemble Methods

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Ensemble Methods for Machine Learning

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Lecture 3: Decision Trees

Performance evaluation of binary classifiers

CSCE 478/878 Lecture 6: Bayesian Learning

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Data Mining und Maschinelles Lernen

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

TDT4173 Machine Learning

A Brief Introduction to Adaboost

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Performance Evaluation and Hypothesis Testing

Linear and Logistic Regression. Dr. Xiaowei Huang

2 Upper-bound of Generalization Error of AdaBoost

Stochastic Gradient Descent

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

CS534 Machine Learning - Spring Final Exam

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Numerical Learning Algorithms

Cross Validation & Ensembling

PDEEC Machine Learning 2016/17

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

2D1431 Machine Learning. Bagging & Boosting

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Diagnostics. Gad Kimmel

Ensemble Methods and Random Forests

Classification of Longitudinal Data Using Tree-Based Ensemble Methods

Generalization and Overfitting

Notes on Machine Learning for and

Performance Evaluation

Introduction to Machine Learning Midterm Exam

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Midterm, Fall 2003

Notation P(Y ) = X P(X, Y ) = X. P(Y X )P(X ) Teorema de Bayes: P(Y X ) = CIn/UFPE - Prof. Francisco de A. T. de Carvalho P(X )

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Introduction to Supervised Learning. Performance Evaluation

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

day month year documentname/initials 1

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Evaluation & Credibility Issues

Stats notes Chapter 5 of Data Mining From Witten and Frank

Bayesian Learning Features of Bayesian learning methods:

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lecture 8. Instructor: Haipeng Luo

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Announcements Kevin Jamieson

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Learning From Crowds. Presented by: Bei Peng 03/24/15

Transcription:

Ensemble Methods NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

How do you make a decision? What do you want for lunch today?! What did you have last night?! What are your favorite foods?! Have you been to this restaurant before?! How expensive is it?! Are you on a diet?! Moral/ethical/religious concerns.

Decision Making Collaboration.!! Weighing disparate sources of evidence.

Ensemble Methods Ensemble Methods are based around the hypothesis that an aggregated decision from multiple experts can be superior to a decision from a single system.

Ensemble averaging Assume your prediction has noise in it y = f(x)+ N (0, y = 1 kx f(x)+ (i) k i On multiple iid observations, the epsilons will cancel out, leaving a better estimate of the signal 2 ) http://terpconnect.umd.edu/~toh/spectrum/signalsandnoise.html

Combining disparate evidence Early Fusion - combine features Acoustic Features Video Features Concatenate Classifier Lexical Features

Combining disparate evidence Late Fusion - combine predictions Acoustic Features Classifier Video Features Classifier Merge Lexical Features Classifier

Classifier Fusion Construct an answer from k predictions Test instance C 1 C 2 C 3 C 4

Classifier Fusion Construct an answer from k predictions Classifier Features Classifier Merge Classifier

Majority Voting Each Classifier generates a prediction and confidence score.! Chose the prediction that receives the most votes predictions from the ensemble Classifier Features Classifier Sum Classifier

Weighted Majority Voting! Most classifiers can be interpreted as delivering a distribution over predictions.! Rather than sum the number of votes, generate an average distribution from the sum.! This is the same as taking a vote where each prediction contributes its confidence. Classifier Features Classifier Weighted Sum Classifier

Sum, Max, Min Majority Voting can be viewed as summing the scores from each ensemble member.! Other aggregation functions can be used including:! maximum! minimum! What is the implication of these? Classifier Features Classifier Aggregator Classifier

Second-tier classifier Classifier predictions are used as input features for a second classifier.! How should the second tier classifier be trained? Classifier Features Classifier Classifer Classifier

Classifier Fusion Advantages! Experts to be trained separately on specialized data! Can be trained quicker, due to smaller data sets and feature space dimensionality.! Disadvantages! Interactions across feature sets may be missed! Explanation of how and why it works can be limited.

Bagging Bootstrap Aggregating! Train k models on different samples of the training data! Predict by averaging results of k models!! Simple instance of majority voting.

Model averaging Seen in Language Modeling! Take, for example a linear classifier y = f(w T x + b) Average model parameters. W = 1 k b = 1 k X i X i W i b i W = 1 k b = 1 k X W i + X b i + i i

Mixture of Experts Can we do better than averaging over all training points?! Look at the input data to see which points are better classified by which classifiers! Allow each expert to focus on those cases where it s already doing better than average

Mixture of Experts The array of P s are called a gating network.! Optimize p i as part of the loss function Classifier p Features Classifier p Classifer Classifier p

Probability Correct under a mixture of experts mixing coefficient for i on c p( d c MoG) = i p c i 1 1 d c o c e 2 i 2π 2 prob. desired output on c gaussian loss between desired and observed output From Hinton Lecture

Gating network Gradient ) ( 2 log ) ( log 2 2 1 2 2 1 2 2 1 1 2 c i c j c j o c d c j c i o c d c i c i c c i o c d i c i c o d e p e p o E e p MoE d p = = π posterior probability of expert i From Hinton Lecture

AdaBoost Adaptive Boosting! Construct an ensemble of weak classifiers.! Typically single split decision trees! Identify weights for each classifier.

Weak Classifiers Weak classifiers:! low performance (slightly better than chance)! high variance! (for adaboost) should have uncorrelated errors.

Boosting Hypothesis The existence of a weak learner implies the existence of a strong learner.

AdaBoost Decision Function C(x) = 1 C 1 (x)+ 2 C 2 (x)+...+ k C k (x) AdaBoost generates a prediction from a weighted sum of predictions of each classifier.! The weight training is The AdaBoost algorithm determines the different from weights.! any loss Similar to systems that use a function second tier classifier to learn a combination we ve function. used.

AdaBoost training algorithm Repeat! Identify the best unused classifier C i.! Assign it a weight based on its performance! Update the weights of each data point based on whether or not it is classified correctly! Until performance converges or all classifiers are included.

Identify the best classifier Generate hypotheses using each unused classifier.! Calculate weighted error using the current data point weights.! Data point weights are initialized to one. W e = X y i 6=k m (x i ) w (m) i How many errors were made

Generate a weight for the classifier ratio of error to previous iteration new weight e m = W m W m = 1 2 ln 1 e m em The larger the reduction in error, the larger the classifier weight

Data Point weighting If data point i was not correctly classified!!! w (m+1) i r = w (m) e i m = w (m) i em 1 e m >1 If data point i was correctly classified w (m+1) i r = w (m) e i m = w (m) 1 i e m em <1

AdaBoost training algorithm Repeat! Identify the best unused classifier C i.! Assign it a weight based on its performance! Update the weights of each data point based on whether or not it is classified correctly! Until performance converges or all classifiers are included.

Random Forests Random Forests are similar to AdaBoost decision trees. (sans adaptive training)! An ensemble of classifiers is trained each on a different subset of features and a different set of data points! Random subspace projection

Decision Tree world&state& is&it&raining? & no % yes % no % is&the&sprinkler&on? & yes % P(wet)& =&0.95 & P(wet)& =&0.1 & P(wet)& =&0.9 &

Construct a Forest of Tree tree"t 1 " tree"t T category"c

Training Algorithm Divide training data into K subsets of data points and M variables.! Improved Generalization! Reduced Memory requirements! Train a unique decision tree on each K set! Simple multi threading

Handling Class Imbalances Class imbalance, or skewed class distributions happen when there are not equal numbers of each label.! Class Imbalance provides a number of challenges! Density Estimation! low priors can lead to poor estimation of minority classes! Loss Functions! Since the loss of each point is equal, getting a lot of majority class points correct is important.! Evaluation! Accuracy is less informative.

Impact on Accuracy Example from Information Retrieval! Find 10 relevant documents from a set of 100. Accuracy = 90% True Values Positive Negative Hyp Values Positive 0 0 Negative 10 90

Contingency Table True Values Positive Negative Hyp Values Positive Negative True Positive False Negative False Positive True Negative Accuracy = TP + TN TP + FP + TN + FN

F-Measure Precision: how many hypothesized P = TP TP + FP events were true events Recall: how many of the true events were identified R = TP TP + FN F-Measure: Harmonic mean of precision and recall F = 2PR P + R Hyp Values True Values Positive Negative Positive 0 0 Negative 10 90

F-Measure F-measure can be weighted to favor Precision or Recall beta > 1 favors recall beta < 1 favors precision F = (1 + 2 )PR ( 2 P )+R

F-Measure Hyp Values True Values Positive Negative Positive 0 0 Negative 10 90 P =0 R =0 F 1 =0

F-Measure Hyp Values True Values Positive Negative Positive 10 50 Negative 0 40 P = 10 60 R = 1 F 1 =.29

F-Measure Hyp Values True Values Positive Negative Positive 9 1 Negative 1 89 P =.9 R =.9 F 1 =.9

ROC and AUC It is common to plot classifier performance at a variety of settings or thresholds Receiver Operating Characteristic (ROC) curves plot true positives against false positives. The overall performance is calculated by the Area Under the Curve (AUC)

Skew in Classifier Training Most classifiers train better with balanced training data.! Bayesian methods:! Reliance on a prior to weight classes.! Estimation of class conditioned density is impacted by skew in number of samples! Loss functions:! There is more pressure to set the decision boundary for the majority classes

Skew in Classifier Training

Skew in Classifier Training Twice as many errors Same distance from optimal decision boundary

Sampling Artificial manipulation of the number of training samples can help reduce the impact of class imbalance! Under sampling! Randomly select N m data points from the majority class for training

Sampling Oversampling! Reproduce the minority class points until the class sizes are balanced

Ensemble Sampling Ensemble Sampling! Repeat undersampling N M/Nm times with different samples of the majority class data points.! Train N M/Nm classifiers, combine with majority voting. C1 C2 Merge C3

Ensemble Methods Very simple and effective technique to improve classification performance! Netflix Prize, Watson, etc.! Mathematical justification! Intuitive appeal into how decisions are made by people and organizations! Can allow for modular training