Learning Ensembles. 293S T. Yang. UCSB, 2017.

Similar documents
A Brief Introduction to Adaboost

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Learning with multiple models. Boosting.

CS7267 MACHINE LEARNING

Voting (Ensemble Methods)

Boosting & Deep Learning

Background. Adaptive Filters and Machine Learning. Bootstrap. Combining models. Boosting and Bagging. Poltayev Rassulzhan

Ensemble Methods for Machine Learning

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Ensembles of Classifiers.

Boos$ng Can we make dumb learners smart?

Ensembles. Léon Bottou COS 424 4/8/2010

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Stochastic Gradient Descent

TDT4173 Machine Learning

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Machine Learning. Ensemble Methods. Manfred Huber

ECE 5424: Introduction to Machine Learning

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Analysis of the Performance of AdaBoost.M2 for the Simulated Digit-Recognition-Example

VBM683 Machine Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Ensemble Methods: Jay Hyer

Statistics and learning: Big Data

Lecture 8. Instructor: Haipeng Luo

Data Mining und Maschinelles Lernen

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Neural Networks and Ensemble Methods for Classification

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Announcements Kevin Jamieson

Cross Validation & Ensembling

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Computational and Statistical Learning Theory

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Boosting: Foundations and Algorithms. Rob Schapire

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Totally Corrective Boosting Algorithms that Maximize the Margin

TDT4173 Machine Learning

Harrison B. Prosper. Bari Lectures

Data Warehousing & Data Mining

Chapter 14 Combining Models

Decision Trees: Overfitting

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

ECE 5984: Introduction to Machine Learning

Gradient Boosting (Continued)

Hierarchical Boosting and Filter Generation

Random Classification Noise Defeats All Convex Potential Boosters

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

FINAL: CS 6375 (Machine Learning) Fall 2014

Variance Reduction and Ensemble Methods

COMS 4771 Lecture Boosting 1 / 16

Learning theory. Ensemble methods. Boosting. Boosting: history

Logistic Regression and Boosting for Labeled Bags of Instances

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

i=1 = H t 1 (x) + α t h t (x)

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Statistical Machine Learning from Data

CS229 Supplemental Lecture notes

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

SPECIAL INVITED PAPER

CSCI-567: Machine Learning (Spring 2019)

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Combining Classifiers

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

2 Upper-bound of Generalization Error of AdaBoost

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

10701/15781 Machine Learning, Spring 2007: Homework 2

Bias-Variance in Machine Learning

Ensemble Learning in the Presence of Noise

Large-Margin Thresholded Ensembles for Ordinal Regression

Monte Carlo Theory as an Explanation of Bagging and Boosting

An overview of Boosting. Yoav Freund UCSD

Introduction to Boosting and Joint Boosting

SF2930 Regression Analysis

1 Handling of Continuous Attributes in C4.5. Algorithm

Boosting: Algorithms and Applications

Discriminative v. generative

Stat 502X Exam 2 Spring 2014

1 Handling of Continuous Attributes in C4.5. Algorithm

Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC)

Lecture 3: Decision Trees

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY

Self Supervised Boosting

Notes on Machine Learning for and

1/sqrt(B) convergence 1/B convergence B

CS 231A Section 1: Linear Algebra & Probability Review

Transcription:

Learning Ensembles 293S T. Yang. UCSB, 2017.

Outlines Learning Assembles Random Forest Adaboost

Training data: Restaurant example Examples described by attribute values (Boolean, discrete, continuous) E.g., situations where I will/won't wait for a table: Classification of examples is positive (T) or negative (F)

A decision tree to decide whether to wait imagine someone talking a sequence of decisions.

Learning Ensembles Learn multiple classifiers separately Combine decisions (e.g. using weighted voting) When combing multiple decisions, random errors cancel each other out, correct decisions are reinforced. Training Data Data1 Data2 Data m Learner1 Learner2 Learner m Model1 Model2 Model m Model Combiner Final Model 5

Homogenous Ensembles Use a single, arbitrary learning algorithm but manipulate training data to make it learn multiple models. Data1 ¹ Data2 ¹ ¹ Data m Learner1 = Learner2 = = Learner m Methods for changing training data: Bagging: Resample training data Boosting: Reweight training data DECORATE: Add additional artificial training data Data1 Training Data Data2 Data m Learner1 Learner2 Learner m

Bagging Create ensembles by repeatedly randomly resampling the training data (Brieman, 1996). Given a training set of size n, create m sample sets Each bootstrap sample set will on average contain 63.2% of the unique training examples, the rest are replicates. Combine the m resulting models using majority vote. Decreases error by decreasing the variance in the results due to unstable learners, algorithms (like decision trees) whose output can change dramatically when the training data is slightly changed. 7

Random Forests Introduce two sources of randomness: Bagging and Random input vectors Each tree is grown using a bootstrap sample of training data At each node, best split is chosen from random sample of m variables instead of all variables M. m is held constant during the forest growing Each tree is grown to the largest extent possible Bagging using decision trees is a special case of random forests when m=m

Random Forests

Random Forest Algorithm Good accuracy without over-fitting Fast algorithm (can be faster than growing/pruning a single tree); easily parallelized Handle high dimensional data without much problem

Boosting: AdaBoost Yoav Freund and Robert E. Schapire. A decisiontheoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119 139, August 1997. Simple with theoretical foundation

Adaboost - Adaptive Boosting Use training set re-weighting Each training sample uses a weight to determine the probability of being selected for a training set. AdaBoost is an algorithm for constructing a strong classifier as linear combination of simple weak classifier Final classification based on weighted sum of weak classifiers 12

AdaBoost: An Easy Flow Original training set training instances that are wrongly predicted by Learner 1 will be weighted more for Learner 2... Data set 1 Data set 2 Data set T... Learner 1 Learner 2... Learner T weighted combination

Adaboost Terminology h t (x) weak or basis classifier strong or final classifier Weak Classifier: < 50% error over any distribution Strong Classifier: thresholded linear combination of weak classifier outputs 14

training case has large weigh in this round this DT has a strong vot And in a Picture training case correctly classified

AdaBoost.M1 Given training set X={(x 1,y 1 ),,(x m,y m )} y i Î{-1,+1} correct label of instance x i ÎX Initialize distribution D 1 (i)=1/m; ( weight of training cases) for t = 1,,T: Find a weak classifier ( rule of thumb ) h t : X {-1,+1} with small error e t on D t : Update distribution D t on {1,,m}. α t = log(1/ε t -1) y i * h t (x i ) > 0, if correct y i * h t (x i ) < 0, if wrong output final hypothesis T å t = 1 H( x) = sign( a h t t ( x))

Reweighting y * h(x) = 1 y * h(x) = -1 17

Toy Example

Round 1 Weak classifier: if h 1 <0.2 à 1 else -1

Round 2 Weak classifier: if h 2 <0.8 à 1 else -1

Round 3 Weak classifier: if h 3 >0.7 à 1 else -1

Final Combination if h 1 <0.2 à 1 else -1 if h 2 <0.8 à 1 else -1 if h 3 >0.7 à 1 else -1

Pros and cons of AdaBoost Advantages Very simple to implement Does feature selection resulting in relatively simple classifier Fairly good generalization Disadvantages Suboptimal solution Sensitive to noisy data and outliers 23

References Duda, Hart, ect Pattern Classification Freund An adaptive version of the boost by majority algorithm Freund Experiments with a new boosting algorithm Freund, Schapire A decision-theoretic generalization of on-line learning and an application to boosting Friedman, Hastie, etc Additive Logistic Regression: A Statistical View of Boosting Jin, Liu, etc (CMU) A New Boosting Algorithm Using Input-Dependent Regularizer Li, Zhang, etc Floatboost Learning for Classification Opitz, Maclin Popular Ensemble Methods: An Empirical Study Ratsch, Warmuth Efficient Margin Maximization with Boosting Schapire, Freund, etc Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods Schapire, Singer Improved Boosting Algorithms Using Confidence-Weighted Predictions Schapire The Boosting Approach to Machine Learning: An overview Zhang, Li, etc Multi-view Face Detection with Floatboost 24

AdaBoost: Training Error Analysis n Suppose Equivalent n Therefore, training error is: n As: Considering Finally: 1 { i 1 D + ( i) = 1, exp( - y f( x )) = Z å å Õ i T 1 i i t T t m i : H( x ) ¹ Õ i i m t= 1 y } {i: H(x i ) y i } is a vector which i-th element is [H(x i ) y i ]. {i: H(xi) yi} is the sum of all the element in the vector z t

AdaBoost: How to choose a t n n n n According to Therefore, we choose Let u i i t 1 This equation is obvious if we treat u i as a binaryvalued variable. { i : = y h ( x ),a = H( x ) a ï å The right term is minimized when î ¹ Õ i i m t= 1 i y } T z t Minimize the error bound could be done - tyh i t i = arg min Z = arg by min greedily å Dminimizing ( i) e a * t t t at at i Actually AdaBoost can just minimize the Z a t each round. By tsetting training dz/dα=0 error., and considering ì D(i)=1, we can 1 = ådt( i) + ådt( i) easily get this ï hsolution. = y h¹ y 1- r í, let e = Dt ( i), we have e = ï h¹ y r = D() i - D() i å t h= y h¹ y t å 2 ( x)