Boosting with decision stumps and binary features

Similar documents
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

10701/15781 Machine Learning, Spring 2007: Homework 2

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Lecture 8. Instructor: Haipeng Luo

Learning with multiple models. Boosting.

10-701/ Machine Learning - Midterm Exam, Fall 2010

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

CS229 Supplemental Lecture notes

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Stochastic Gradient Descent

Voting (Ensemble Methods)

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

ECE 5984: Introduction to Machine Learning

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

ECE 5424: Introduction to Machine Learning

Learning Ensembles. 293S T. Yang. UCSB, 2017.

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Multiclass Boosting with Repartitioning

A Brief Introduction to Adaboost

Machine Learning, Fall 2011: Homework 5

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Ensembles. Léon Bottou COS 424 4/8/2010

Application of machine learning in manufacturing industry

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Chapter 14 Combining Models

CS7267 MACHINE LEARNING

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Boos$ng Can we make dumb learners smart?

VBM683 Machine Learning

Classifier Performance. Assessment and Improvement

A Simple Algorithm for Multilabel Ranking

Decision Trees: Overfitting

CS534 Machine Learning - Spring Final Exam

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Announcements Kevin Jamieson

Boosting Based Conditional Quantile Estimation for Regression and Binary Classification

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

Supervised Learning of Non-binary Problems Part I: Multiclass Categorization via Output Codes

Tackling the Poor Assumptions of Naive Bayes Text Classifiers

Multi-Layer Boosting for Pattern Recognition

Multi-class SVMs. Lecture 17: Aykut Erdem April 2016 Hacettepe University

Evaluation requires to define performance measures to be optimized

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

The Bayes classifier

Hierarchical Boosting and Filter Generation

Linear Classifiers: Expressiveness

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Boosting: Foundations and Algorithms. Rob Schapire

2 Upper-bound of Generalization Error of AdaBoost

MP-Boost: A Multiple-Pivot Boosting Algorithm and Its Application to Text Categorization

Classification: The rest of the story

Shankar Shivappa University of California, San Diego April 26, CSE 254 Seminar in learning algorithms

Learning theory. Ensemble methods. Boosting. Boosting: history

Qualifier: CS 6375 Machine Learning Spring 2015

Harrison B. Prosper. Bari Lectures

COMS 4771 Lecture Boosting 1 / 16

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Minimax risk bounds for linear threshold functions

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

An overview of Boosting. Yoav Freund UCSD

Boosting Ensembles of Structured Prediction Rules

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

1 Boosting. COS 511: Foundations of Machine Learning. Rob Schapire Lecture # 10 Scribe: John H. White, IV 3/6/2003

6.036 midterm review. Wednesday, March 18, 15

Iron Ore Rock Recognition Trials

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Name (NetID): (1 Point)

Infinite Ensemble Learning with Support Vector Machinery

Boosting: Algorithms and Applications

Logistic Regression. Machine Learning Fall 2018

Cross Validation & Ensembling

Linear Models in Machine Learning

A Large Deviation Bound for the Area Under the ROC Curve

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Introduction to Boosting and Joint Boosting

Name (NetID): (1 Point)

CSCI-567: Machine Learning (Spring 2019)

Improved Boosting Algorithms Using Confidence-rated Predictions

Totally Corrective Boosting Algorithms that Maximize the Margin

CS540 ANSWER SHEET

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Linear classifiers: Logistic regression

Ordinal Classification with Decision Rules

Introduction to Machine Learning Spring 2018 Note 18

Text Categorization by Boosting Automatically Extracted Concepts

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Machine Learning for NLP

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Transcription:

Boosting with decision stumps and binary features Jason Rennie jrennie@ai.mit.edu April 10, 2003 1 Introduction A special case of boosting is when features are binary and the base learner is a decision stump (which assigns examples a label based on a single feature). Text classification problems tend to have the property that the presence of a feature is more useful than its absence. But, in this special case, an algorithm like AdaBoost will assign equal magnitude weights to both positive and negative occurrences. This will make the path to convergence less direct and may cause AdaBoost to select features poorly. The alternative is to use a base learner that can abstain. But, then we lose the ability to easily weight nonoccurrences. Solutions include allowing the base learner to weight occurrences and non-occurrences separately and adjusting the bias term to compensate. We discuss such algorithms, pointing out their advantages and disadvantages. 2 Adaboost In this section, we give the derivation of AdaBoost for the case of binary features and a decision stump base learner. Let (x 1,..., x m ) be the set of training examples, x i X. Let (y 1,..., y m ) be their labels, y i { 1, +1}. Let x ij = +1 if feature j is present in example i, x ij = 1 otherwise. Assume that the classifier learned by AdaBoost up to round n is f(x) = w 0 + d w j x j, (1) where d is the number of features and x j is the feature value { 1, +1} for the test example x. We want to modify the weight on one of the features so j=1 1

as to maximally decrease the loss functional, L(f) = e y if(x i ). (2) If we modify the weight of the k th feature, the loss becomes L k (f) = e y if(x i ) y i δx ik, (3) where δ is the amount we add to w k. Minimizing with respect to δ gives us L k δ = y i x ik e y if(x i ) y i δx ik = 0 (4) i e δ e y if(x i ) = e δ e y if(x i ) (5) i:y i x ik = 1 δ = 1 2 log i:y i x ik =1 e y if(x i ) i:y i x ik = 1 e y if(x i ) i:y i x ik =1 Let W + = i:y i x ik =1 e y if(x i ). Let W = i:y i x ik = 1 e y if(x i ). Then, the feature that maximally reduces the loss is defined by the feature, k, that maximizes L(f) L k (f) = e y if(x i ) e y if(x i ) y i δx ik (7) i i = (1 e δ ) e y if(x i ) + (1 e δ ) e y if(x i ) i:y i x ik =1 i:y i x ik = 1 = ( W + W ) 2. (9) Equivalently, the feature that maximally reduces the loss is the one that maximizes W +. W 3 ZeroOneBoost In the special case that we have described, AdaBoost has the undesirable property that a non-occurrence of a feature contributes the same magnitude score as does an occurrence. We consider a slight variant of AdaBoost (6) (8) 2

wherein the decision stump returns +1 if the chosen feature occurs and 0 if it does not. We use the same notation as before, except that now x ij = 0 if feature j is not present in example i. Again, for each feature, we want to choose δ to minimize the loss. We result in the same formula for δ, δ = 1 2 log W + W. (10) But, it is subtly different than before. Whereas before the two sets, {i : y i x ik = 1} and {i : y i x ik = 1}, enumerated all possible values of i, they now only enumerate those for which feature k appears (x ik = 1). We must now pay attention to the fact that W + or W can easily be zero. Smoothing is used to avoid taking log of zero or infinity. We compute δ as δ = 1 2 log W + + ɛ W + ɛ, (11) where ɛ is some small constant. To determine the feature that maximally reduces the loss, we compute, L(f) L k (f). (12) Again, the result is idential to what we found before, with the subtle difference that the two sets of examples that W + and W deal with do not cover the entire set of examples. 4 Bias Correction In both of the above-described algorithms, weights are chosen for the features, but in the first case weights are the same for occurences & nonoccurrences and in the second case, a weight of zero is always used for nonoccurrences. Usually, improvements can be made by adding a bias value. Consider AdaBoost where we have just chosen a feature, k, and weight, δ. Now we wish to choose a bias, γ, so as to minimize the loss functional. The loss is L(f) = e y if(x i ) y i γ. (13) 3

Solving for the minimum, we get L γ = y i e y if(x i ) y i γ = 0 (14) γ = 1 2 log i:y i =1 e y if(x i ) i:y i = 1 e y if(x i ), (15) or, using definitons we defined earlier, γ = 1 2 log W +. This is the same W whether we are using 1/+1 features or 0/+1 features and we do not need to smooth. 5 Separate occurrence and non-occurrence weights Schapire and Singer discuss learning weights for the occurrence and nonoccurrence of features separately [1]. This is similar to selecting a bias. But, whereas we have described a process of first selecting a weight and then choosing a bias, here the weights are chosen simultaneously, albeit in separate optimizations (the weights are not chosen to joinly minimize the loss functional). Let x ik {0, +1}. We conduct two separate minimizations. In the first, we choose δ +, the weight for the feature s occurrence. Then, we choose δ, the weight for the features s non-occurrence. The formulas are identical to those already described in section 2. For δ +, we use 0/+1 features, so smoothing is required. For δ, we use +1/0 features. That is, x ij = 0 when feature j occurs, x ij = +1 when it does not. Each of the two weights will yield a separate reduction in the loss function. We choose the feature whose sum of the two loss reductions is largest. 6 NewtonBoost None of the above algorithms finds the weight and bias term that minimizes the loss functional. The closest are the algorithms that first choose a weight, then do bias correction. An algorithm that is uniformly better than the one Schapire and Singer describe is one that picks a weight for occurrence or non-occurrence (depending on which gives lower loss) and then does bias correction. Equivalently, one could choose one weight and then choose the second weight in the context of the first. In this section, we explore the idea of jointly optimizing the occurrence and non-occurrence weights, or jointly optimizing the weight and the bias 4

term. For each feature, k, consider selecting a weight, δ, and bias, γ. The new loss is L k (f) = e y if(x i ) e y iδx ik y i γ, (16) Minimizing for γ and δ, we get L k γ = i y i e y if(x i ) e y iδx ik e y iγ = 0 (17) γ = 1 2 log i:y i =1 e y if(x i ) e y iδx ik i:y i = 1 e y if(x i ) e y iδx ik (18) L k δ = i y i x ik e y if(x i ) e y iδx ik e y iγ = 0 (19) δ = 1 2 log i:y i x ik =1 e y if(x i ) e y iγ i:y i x ik = 1 e y if(x i ) e y iγ Now, assume that x ij {0, +1}. Then δ = γ + 1 2 log W + (20) W. (21) Now define B + = i:y i =1 e y if(x i ), B = i:y i = 1 e y if(x i ). Recall that W + = i:y i x ik =1 e y if(x i ) and W = i:y i x ik = 1 e y if(x i ). We simplify γ to γ = 1 2 log B+ + (e δ 1)W + B + (e δ 1)W (22) Define N + = i:y i =1,x ik =0 e y if(x i ), N + = i:y i = 1,x ik =0 e y if(x i ). Substituting for δ, we get γ = 1 2 log N + + (e γ W + W ) N + (e γ W + W ) (23) We cannot solve for γ directly. Instead, we use Newton s method. Let y = f(γ), where f(γ) = γ 1 2 log N + + e γ W + W N + e γ W + W. (24) 5

Then, ( f (γ) = 1 1 e γ ) W + W 2 N + + e γ W + W + e γ W + W N + e γ W + W Newton s method calculates the n + 1 st approximation to γ as (25) γ n+1 = γ n f(γ n) f (γ n ). (26) We have not been able to show that Newton s method will converge. It can be shown that f has one inflection point and that if we add ɛ to N + and N, then f (γ) > ɛ 1+ɛ. Visual inspection of example plots (see figure 1) over a normal range of values show cases that will converge. Figure 1: Example plots of f(γ). We can solve for the bias term, γ, via Newton s method, then calculate the weight, δ, as function of γ. We make this calculation for each feature and choose the one that yields the largest decrease in the loss functional, L(f) L k (f) = e y if(x i ) (1 e y ix ik δ e yiγ ). (27) 6

This proceedure has an advantage over others described in this paper. It optimizes the weight and bias term jointly. Thus, a feature is chosen based on the real decrease in the loss functional that it can provide. The other techniques use an approximation of the loss delta. However, it is not clear that this provides significant improvement over other techniques. Schapire and Singer noted that learning separate occurrence and non-occurrence weights yielded little improvement over learning only occurrence weights. This technique may give little additional improvement. References [1] Robert E. Schapire and Yoram Singer. Boostexter: A boosting-based system for text categorization. Machine Learning, 39:135 168, 2000. 7