Boostrapaggregating (Bagging)

Similar documents
Ensemble Methods: Boosting

Generalized Linear Methods

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

10-701/ Machine Learning, Fall 2005 Homework 3

CSC 411 / CSC D11 / CSC C11

Online Classification: Perceptron and Winnow

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

CSE 546 Midterm Exam, Fall 2014(with Solution)

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Multilayer Perceptron (MLP)

Homework Assignment 3 Due in class, Thursday October 15

COS 511: Theoretical Machine Learning

Lecture 4. Instructor: Haipeng Luo

Linear Classification, SVMs and Nearest Neighbors

MDL-Based Unsupervised Attribute Ranking

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Kernel Methods and SVMs Extension

Evaluation for sets of classes

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Natural Language Processing and Information Retrieval

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

A Robust Method for Calculating the Correlation Coefficient

BOOTSTRAP METHOD FOR TESTING OF EQUALITY OF SEVERAL MEANS. M. Krishna Reddy, B. Naveen Kumar and Y. Ramu

Chapter 13: Multiple Regression

Lecture 10 Support Vector Machines II

SDMML HT MSc Problem Sheet 4

Cluster Validation Determining Number of Clusters. Umut ORHAN, PhD.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Research Article ISSN:

1 The Mistake Bound Model

Robust and Efficient Boosting Method using the Conditional Risk

Negative Binomial Regression

Markov Chain Monte Carlo Lecture 6

CSCI B609: Foundations of Data Science

Support Vector Machines

Lecture Notes on Linear Regression

Linear Feature Engineering 11

Nonlinear Classifiers II

Support Vector Machines

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

Economics 130. Lecture 4 Simple Linear Regression Continued

NUMERICAL DIFFERENTIATION

Bounds on the Generalization Performance of Kernel Machines Ensembles

/ n ) are compared. The logic is: if the two

Lecture 4 Hypothesis Testing

Example: (13320, 22140) =? Solution #1: The divisors of are 1, 2, 3, 4, 5, 6, 9, 10, 12, 15, 18, 20, 27, 30, 36, 41,

Supporting Information

MAXIMUM A POSTERIORI TRANSDUCTION

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

Multigradient for Neural Networks for Equalizers 1

Statistics II Final Exam 26/6/18

Feature Selection: Part 1

The big picture. Outline

Lossy Compression. Compromise accuracy of reconstruction for increased compression.

Dynamic Ensemble Selection and Instantaneous Pruning for Regression

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Bayesian Learning. Smart Home Health Analytics Spring Nirmalya Roy Department of Information Systems University of Maryland Baltimore County

EEE 241: Linear Systems

Regularized Discriminant Analysis for Face Recognition

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

SVM-Based Negative Data Mining to Binary Classification

Expectation Maximization Mixture Models HMMs

Support Vector Machines

4.3 Poisson Regression

Topic 23 - Randomized Complete Block Designs (RCBD)

I529: Machine Learning in Bioinformatics (Spring 2017) Markov Models

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Learning Theory: Lecture Notes

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Classification as a Regression Problem

The Study of Teaching-learning-based Optimization Algorithm

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Support Vector Machines

Evaluation of classifiers MLPs

Natural Images, Gaussian Mixtures and Dead Leaves Supplementary Material

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Which Separator? Spring 1

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Semi-supervised Classification with Active Query Selection

Quantifying Uncertainty

Errors for Linear Systems

1 Convex Optimization

Department of Statistics University of Toronto STA305H1S / 1004 HS Design and Analysis of Experiments Term Test - Winter Solution

is the calculated value of the dependent variable at point i. The best parameters have values that minimize the squares of the errors

Durban Watson for Testing the Lack-of-Fit of Polynomial Regression Models without Replications

Problem Set 9 Solutions

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

What would be a reasonable choice of the quantization step Δ?

Week 5: Neural Networks

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Statistical Inference. 2.3 Summary Statistics Measures of Center and Spread. parameters ( population characteristics )

Comparison of Regression Lines

Transcription:

Boostrapaggregatng (Baggng) An ensemble meta-algorthm desgned to mprove the stablty and accuracy of machne learnng algorthms Can be used n both regresson and classfcaton Reduces varance and helps to avod overfttng Usually appled to decson trees, though t can be used wth any type of method 1

An Asde: Ensemble Methods In a nutshell: A combnaton of multple learnng algorthms wth the goal of achevng better predctve performance than could be obtaned from any of these classfers alone A meta-algorthm that can be consdered to be, n tself, a supervsed learnng algorthm snce t produces a sngle hypothess Tend to work better when there s dversty among the models Examples: Baggng Boostng Bucket Stackng 2

An Asde: Ensemble Methods Tradtonal: Ensemble Method: S S L 1 Dfferent tranng sets and/or learnng algorthms L 1 L 2 L 3 L 4 L 5 L 6 (x,? ) h 1 h 1 h 2 h 3 h 4 h 5 h 6 (x, y = h 1 (x)) (x,? ) h = f(h 1,, h 6 ) 3 (x, y = h 1 (x))

The dea: 1. Create N boostrap samples {S 1,, S N } of S as follows: For each S, randomly draw S examples from S wth replacement 2. For each = 1,, N h = Learn(S ) Back to Baggng 1. Output H =< h 1,, h N, majortyvote > 4

Most notable benefts 1. Surprsngly compettve performance & rarely overfts 2. Is capable of reducng varance of consttuent models 3. Improves ablty to gnore rrelevant features Remember: error(x) = nose(x) + bas(x) + varance(x) 5 Varance: how much does predcton change f we change the tranng set?

Baggng Example 1 6

Baggng Example 2 7

Baggng Example 3 (1) 8

Baggng Example 3 (2) Accuracy: 100% 9

How does baggng mnmze error? Ensemble reduces the overall varance Let f(x) be the target value of x, h 1 to h n be the set of base hypothess, and h avg be the predcton of the base hypotheses Error h, x = f x h x 2 Squared error Is there any relaton between h avg and varance? Yes 10

How does baggng mnmze error? Error h, x = f x h x 2 Error h avg, x = 1n Error h,x n n By the above, we see that the squared error of the average hypothess equals the average squared error of the base hypotheses mnus the varance of the base hypotheses = 1n h x h avg x 2 11

Stabltyof Learn A learnng algorthm s unstable f small changes n the tranng data can produce large changes n the output hypothess (otherwse stable) Clearly baggng wll have lttle beneft when used wth stable base learnng algorthms (.e., most ensemble members wll be very smlar) Baggng generally works best when used wth unstable yet relatvely accurate base learners 12

Baggng Summary Works well f the base classfers are unstable (complement each other) Increased accuracy because t reduces the varance of the ndvdual classfer Does not focus on any partcular nstance of the tranng data Therefore, less susceptble to model over-fttng when appled to nosy data 13

Boostng Key dfferences wth respect to baggng: It s teratve: Baggng: Each ndvdual classfer s ndependent Boostng: Looks at the errors from prevous classfers to decde what to focus on for the next teraton Successve classfers depend on ther predecessors Key dea: place more weght on hard examples (.e., nstances that were msclassfed on prevous teratons) 14

Hstorcal Notes The dea of boostng began wth a learnng theory queston frst asked n the late 80 s The queston was answered n 1989 by Robert Shapre resultng n the frst theoretcal boostng algorthm Shapre and Freund later developed a practcal boostng algorthm called Adaboost Many emprcal studes show that Adaboost s hghly effectve(very often they outperform ensembles produced by baggng) 15

Boostng An teratve procedure to adaptvely change dstrbuton of tranng data by focusng more on prevously msclassfed records Intally, all N records are assgned equal weghts Unlke baggng, weghts may change at the end of a boostng round Dfferent mplementatons vary n terms of (1) how the weghts of the tranng examples are updated and (2) how the predctons are combned 16

Boostng Records that are wrongly classfed wll have ther weghts ncreased Records that are classfed correctly wll have ther weghts decreased Orgnal Data 1 2 3 4 5 6 7 8 9 10 Boostng (Round 1) 7 3 2 8 7 9 4 10 6 3 Boostng (Round 2) 5 4 9 4 2 5 1 7 4 2 Boostng (Round 3) 4 4 8 10 4 5 4 6 3 4 Example 4 s hard to classfy Its weght s ncreased, therefore t s more lkely to be chosen agan n subsequent rounds 17

Boostng Equal weghts are assgned to each tranng nstance (1/N for round 1) at frst round After a classfer C s learned, the weghts are adjusted to allow the subsequent classfer C +1 to pay more attenton to data that were msclassfed by C. Fnal boosted classfer C combnes the votes of each ndvdual classfer Weght of each classfer s vote s a functon of ts accuracy Adaboost popular boostng algorthm 18

Adaboost (Adaptve Boost) Input: Tranng set D contanng N nstances T rounds A classfcaton learnng scheme Output: A composte model 19

Adaboost: Tranng Phase Tranng data D contan N labeled data : (X 1, y 1 ), (X 2, y 2 ), (X 3, y 3 ),. (X N, y N ) Intally assgn equal weght 1/N to each data To generate T base classfers, we need T rounds or teratons Round : data from D are sampled wth replacement, to form D (sze N) Each data s chance of beng selected n the next rounds depends on ts weght Each tme the new sample s generated drectly from the tranng data D wth dfferent samplng probablty accordng to the weghts; these weghts are not zero 20

Adaboost: Tranng Phase Base classfer C, s derved from tranng data of D Error of C s tested usng D Weghts of tranng data are adjusted dependng on how they were classfed Correctly classfed: Decrease weght Incorrectly classfed: Increase weght Weght of a data ndcates how hard t s to classfy t (drectly proportonal) 21

Adaboost: Testng Phase 22 The lower a classfer error rate, the more accurate t s, and therefore, the hgher ts weght for votng should be Weght of a classfer C s vote s 1 1 ln 2 Testng: For each class c, sum the weghts of each classfer that assgned class c to X (unseen data) The class wth the hghest sum s the WINNER! C *( x ) arg max C test y T 1 ( x test ) y

Example: Error and Classfer Weght n AdaBoost Base classfers: C 1, C 2,, C T Error rate: = ndex of classfer j=ndex of nstance 1 N j j) N j 1 Importance of a classfer: w C ( x y j 1 2 1 ln 23

Example: Data Instance Weght n AdaBoost Assume: N tranng data n D, T rounds, (x j, y j ) are the tranng data, C, a are the classfer and weght of the th round, respectvely Weght update on all tranng data n D: w ( 1) j where w Z Z ( ) j s exp exp f f C ( x the normalzaton factor C ( x j j ) ) y y j j 24 C *( x ) arg max C test y T 1 ( x test ) y

Illustratng AdaBoost Intal weghts for each data pont Data ponts for tranng Orgnal 0.1 0.1 0.1 - - - - + + Data + + + - Boostng B1 0.0094 0.0094 0.4623 - - - - - - Round 1 + + + - = 1.9459 25

Illustratng AdaBoost Boostng Round 1 + + + - Boostng B1 0.0094 0.0094 0.4623 - - - - - - B2 0.3037 0.0009 0.0422 Round 2 - - - - - - - - + + = 1.9459 = 2.9323 0.0276 0.1819 0.0038 Boostng Round 3 + + + + + + + + + + B3 = 3.8744 26 Overall + + + - - - - - + +

Baggng and Boostng Summary Baggng: o Resample data ponts o Weght of each classfer s the same o Only varance reducton o Robust to nose and outlers Boostng: o Reweght data ponts (modfy data dstrbuton) o Weght of classfer vary dependng on accuracy o Reduces both bas and varance o Can hurt performance wth nose and outlers 27