Ensamble methods: Bagging and Boosting

Similar documents
Ensamble methods: Boosting

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Learning with multiple models. Boosting.

Ensemble Confidence Estimates Posterior Probability

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

1 Review of Zero-Sum Games

Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important

Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important

3.1 More on model selection

Notes on online convex optimization

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8)

Tasty Coffee example

Lecture 33: November 29

GMM - Generalized Method of Moments

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD

ACE 562 Fall Lecture 5: The Simple Linear Regression Model: Sampling Properties of the Least Squares Estimators. by Professor Scott H.

Bias in Conditional and Unconditional Fixed Effects Logit Estimation: a Correction * Tom Coupé

Math 2142 Exam 1 Review Problems. x 2 + f (0) 3! for the 3rd Taylor polynomial at x = 0. To calculate the various quantities:

Estimation Uncertainty

Chapter 4. Truncation Errors

Lecture 3: Exponential Smoothing

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Speaker Adaptation Techniques For Continuous Speech Using Medium and Small Adaptation Data Sets. Constantinos Boulis

Supplement for Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence

An recursive analytical technique to estimate time dependent physical parameters in the presence of noise processes

New Boosting Methods of Gaussian Processes for Regression

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS. NA568 Mobile Robotics: Methods & Algorithms

Online Convex Optimization Example And Follow-The-Leader

EXERCISES FOR SECTION 1.5

Spring Ammar Abu-Hudrouss Islamic University Gaza

R t. C t P t. + u t. C t = αp t + βr t + v t. + β + w t

Section 3.5 Nonhomogeneous Equations; Method of Undetermined Coefficients

Approximation Algorithms for Unique Games via Orthogonal Separators

Effect of Pruning and Early Stopping on Performance of a Boosting Ensemble

You must fully interpret your results. There is a relationship doesn t cut it. Use the text and, especially, the SPSS Manual for guidance.

Two Popular Bayesian Estimators: Particle and Kalman Filters. McGill COMP 765 Sept 14 th, 2017

CSE/NB 528 Lecture 14: From Supervised to Reinforcement Learning (Chapter 9) R. Rao, 528: Lecture 14

A Specification Test for Linear Dynamic Stochastic General Equilibrium Models

Understanding the asymptotic behaviour of empirical Bayes methods

References are appeared in the last slide. Last update: (1393/08/19)

ACE 562 Fall Lecture 8: The Simple Linear Regression Model: R 2, Reporting the Results and Prediction. by Professor Scott H.

Robust estimation based on the first- and third-moment restrictions of the power transformation model

Math 10B: Mock Mid II. April 13, 2016

Kriging Models Predicting Atrazine Concentrations in Surface Water Draining Agricultural Watersheds

Lecture 15. Dummy variables, continued

Asymptotic Equipartition Property - Seminar 3, part 1

Econ Autocorrelation. Sanjaya DeSilva

SOLUTIONS TO ECE 3084

ACE 564 Spring Lecture 7. Extensions of The Multiple Regression Model: Dummy Independent Variables. by Professor Scott H.

Application of a Stochastic-Fuzzy Approach to Modeling Optimal Discrete Time Dynamical Systems by Using Large Scale Data Processing

Boosting with Online Binary Learners for the Multiclass Bandit Problem

Wednesday, November 7 Handout: Heteroskedasticity

Vehicle Arrival Models : Headway

Estimation of Poses with Particle Filters

Random Walk with Anti-Correlated Steps

Distribution of Estimates

Chapter 7: Solving Trig Equations

Written HW 9 Sol. CS 188 Fall Introduction to Artificial Intelligence

CHAPTER 10 VALIDATION OF TEST WITH ARTIFICAL NEURAL NETWORK

Stability. Coefficients may change over time. Evolution of the economy Policy changes

Hypothesis Testing in the Classical Normal Linear Regression Model. 1. Components of Hypothesis Tests

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

Solutions to Odd Number Exercises in Chapter 6

Lecture 23: I. Data Dependence II. Dependence Testing: Formulation III. Dependence Testers IV. Loop Parallelization V.

Modeling and Forecasting Volatility Autoregressive Conditional Heteroskedasticity Models. Economic Forecasting Anthony Tay Slide 1

STATE-SPACE MODELLING. A mass balance across the tank gives:

PET467E-Analysis of Well Pressure Tests/2008 Spring Semester/İTÜ Midterm Examination (Duration 3:00 hours) Solutions

t is a basis for the solution space to this system, then the matrix having these solutions as columns, t x 1 t, x 2 t,... x n t x 2 t...

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9)

Navneet Saini, Mayank Goyal, Vishal Bansal (2013); Term Project AML310; Indian Institute of Technology Delhi

Expectation- Maximization & Baum-Welch. Slides: Roded Sharan, Jan 15; revised by Ron Shamir, Nov 15

Using the Kalman filter Extended Kalman filter

The equation to any straight line can be expressed in the form:

) were both constant and we brought them from under the integral.

Frequency independent automatic input variable selection for neural networks for forecasting

Machine Learning 4771

Solutions for Assignment 2

A Novel Family of Boosted Online Regression Algorithms with Strong Theoretical Bounds

Hidden Markov Models

Y. Xiang, Learning Bayesian Networks 1

Presentation Overview

Explaining Total Factor Productivity. Ulrich Kohli University of Geneva December 2015

Pattern Classification (VI) 杜俊

Hidden Markov Models. Adapted from. Dr Catherine Sweeney-Reed s slides

Experiments on logistic regression

The Effect of Nonzero Autocorrelation Coefficients on the Distributions of Durbin-Watson Test Estimator: Three Autoregressive Models

Random Processes 1/24

A Shooting Method for A Node Generation Algorithm

DEPARTMENT OF STATISTICS

Christos Papadimitriou & Luca Trevisan November 22, 2016

(Not) Bounding the True Error

Introduction to Probability and Statistics Slides 4 Chapter 4

Comparing Means: t-tests for One Sample & Two Related Samples

Licenciatura de ADE y Licenciatura conjunta Derecho y ADE. Hoja de ejercicios 2 PARTE A

Modal identification of structures from roving input data by means of maximum likelihood estimation of the state space model

20. Applications of the Genetic-Drift Model

Lecture 10 - Model Identification

Transcription:

Lecure 21 Ensamble mehods: Bagging and Boosing Milos Hauskrech milos@cs.pi.edu 5329 Senno Square Ensemble mehods Mixure of expers Muliple base models (classifiers, regressors), each covers a differen par (region) of he inpu space Commiee machines: Muliple base models (classifiers, regressors), each covers he complee inpu space Each base model is rained on a slighly differen rain se Combine predicions of all models o produce he oupu Goal: Improve he accuracy of he base model Mehods: Bagging Boosing Sacking (no covered) 1

Bagging (Boosrap Aggregaing) Given: Training se of N examples A class of learning models (e.g. decision rees, neural neworks, ) Mehod: Train muliple (k) models on differen samples (daa splis) and average heir predicions Predic (es) by averaging he resuls of k models Goal: Improve he accuracy of one model by using is muliple copies Average of misclassificaion errors on differen daa splis gives a beer esimae of he predicive abiliy of a learning mehod Bagging algorihm Training In each ieraion, =1, T Randomly sample wih replacemen N samples from he raining se Train a chosen base model (e.g. neural nework, decision ree) on he samples Tes For each es example Sar all rained base models Predic by combining resuls of all T rained models: Regression: averaging Classificaion: a majoriy voe 2

Tes examples Simple Majoriy Voing H 1 H 2 H 3 Final Class yes Class no Analysis of Bagging Expeced error= Bias+Variance Expeced error is he expeced discrepancy beween he esimaed and rue funcion E fˆ X E f X Bias is squared discrepancy beween averaged esimaed and rue funcion fˆ X E f X E 2 Variance is expeced divergence of he esimaed funcion vs. is average value E fˆ X E fˆ X 2 2 3

When Bagging works? Under-fiing and over-fiing Under-fiing: High bias (models are no accurae) Small variance (smaller influence of examples in he raining se) Over-fiing: Small bias (models flexible enough o fi well o raining daa) Large variance (models depend very much on he raining se) Averaging decreases variance Example Assume we measure a random variable x wih a N(, 2 ) disribuion If only one measuremen x 1 is done, The expeced mean of he measuremen is Variance is Var(x 1 )= 2 If random variable x is measured K imes (x 1,x 2, x k ) and he value is esimaed as: (x 1 +x 2 + +x k )/K, Mean of he esimae is sill Bu, variance is smaller: [Var(x 1 )+ Var(x k )]/K 2 =K 2 / K 2 = 2 /K Observe: Bagging is a kind of averaging! 4

When Bagging works Main propery of Bagging (proof omied) Bagging decreases variance of he base model wihou changing he bias!!! Why? averaging! Bagging ypically helps When applied wih an over-fied base model High dependency on acual raining daa I does no help much High bias. When he base model is robus o he changes in he raining daa (due o sampling) Boosing Mixure of expers One exper per region Exper swiching Bagging Muliple models on he complee space, a learner is no biased o any region Learners are learned independenly Boosing Every learner covers he complee space Learners are biased o regions no prediced well by oher learners Learners are dependen 5

Boosing. Theoreical foundaions. PAC: Probably Approximaely Correc framework (-) soluion PAC learning: Learning wih pre-specified error and confidence parameers he probabiliy ha he misclassificaion error is larger han is smaller han P ( ME ( c ) ) Accuracy (1- ): Percen of correcly classified samples in es Confidence (1- ): The probabiliy ha in one experimen some accuracy will be achieved P ( Acc ( c) 1 ) (1 ) PAC Learnabiliy Srong (PAC) learnabiliy: There exiss a learning algorihm ha efficienly learns he classificaion wih a pre-specified accuracy and confidence Srong (PAC) learner: A learning algorihm P ha given an arbirary classificaion error (< 1/2), and confidence (<1/2) Oupus a classifier ha saisfies his parameers In oher words gives: classificaion accuracy > (1-) confidence probabiliy > (1- ) And runs in ime polynomial in 1/, 1/ Implies: number of samples N is polynomial in 1/, 1/ 6

Weak Learner Weak learner: A learning algorihm (learner) W ha gives: a classificaion accuracy > 1- o wih probabiliy >1- o For some fixed and unconrollable error o (<1/2) confidence o (<1/2) and his on an arbirary disribuion of daa enries Weak learnabiliy=srong (PAC) learnabiliy Assume here exiss a weak learner i is beer ha a random guess (> 50 %) wih confidence higher han 50 % on any daa disribuion Quesion: Is he problem also PAC-learnable? Can we generae an algorihm P ha achieves an arbirary (-) accuracy? Why is imporan? Usual classificaion mehods (decision rees, neural nes), have specified, bu unconrollable performances. Can we improve performance o achieve any pre-specified accuracy (confidence)? 7

Weak=Srong learnabiliy!!! Proof due o R. Schapire An arbirary (-) improvemen is possible Idea: combine muliple weak learners ogeher Weak learner W wih confidence o and maximal error o I is possible: To improve (boos) he confidence To improve (boos) he accuracy by raining differen weak learners on slighly differen daases Boosing accuracy Training Disribuion samples Learners H 1 H 2 H 3 Correc classificaion Wrong classificaion H 1 and H 2 classify differenly 8

Boosing accuracy Training Sample randomly from he disribuion of examples Train hypohesis H 1. on he sample Evaluae accuracy of H 1 on he disribuion Sample randomly such ha for he half of samples H 1. provides correc, and for anoher half, incorrec resuls; Train hypohesis H 2. Train H 3 on samples from he disribuion where H 1 and H 2 classify differenly Tes For each example, decide according o he majoriy voe of H 1, H 2 and H 3 Theorem If each hypohesis has an error < o, he final voing classifier has error < g( o ) =3 o2-2 o 3 Accuracy improved!!!! Apply recursively o ge o he arge accuracy!!! 9

Theoreical Boosing algorihm Similarly o boosing he accuracy we can boos he confidence a some resriced accuracy cos The key resul: we can improve boh he accuracy and confidence Problems wih he heoreical algorihm A good (beer han 50 %) classifier on all disribuions and problems We canno ge a good sample from daa-disribuion The mehod requires a large raining se Soluion o he sampling problem: Boosing by sampling AdaBoos algorihm and varians AdaBoos AdaBoos: boosing by sampling Classificaion (Freund, Schapire; 1996) AdaBoos.M1 (wo-class problem) AdaBoos.M2 (muliple-class problem) Regression (Drucker; 1997) AdaBoosR 10

AdaBoos Given: A raining se of N examples (aribues + class label pairs) A base learning model (e.g. a decision ree, a neural nework) Training sage: Train a sequence of T base models on T differen sampling disribuions defined upon he raining se (D) A sample disribuion D for building he model is consruced by modifying he sampling disribuion D -1 from he (-1)h sep. Examples classified incorrecly in he previous sep receive higher weighs in he new daa (aemps o cover misclassified samples) Applicaion (classificaion) sage: Classify according o he weighed majoriy of classifiers AdaBoos raining. Training daa Disribuion Learn Tes D 1 Model 1 Errors 1 D 2 Model 2 Errors 2 D T Model T Errors T 11

AdaBoos algorihm Training (sep ) Sampling Disribuion D D ( i ) - a probabiliy ha example i from he original raining daase is seleced D 1 ( i ) 1 / N for he firs sep (=1) Take K samples from he raining se according o D Train a classifier h on he samples Calculae he error of h : D ( i) i: h ( x i ) y i Classifier weigh: /( 1 ) New sampling disribuion D ( ) h ( x i ) y i i D 1 ( i ) Z 1 oherwise Norm. consan AdaBoos. Sampling Probabiliies Example: - Nonlinearly separable binary classificaion - NN as week learners 12

AdaBoos: Sampling Probabiliies AdaBoos classificaion We have T differen classifiers h weigh w of he classifier is proporional o is accuracy on he raining se w log( 1 / ) log (1 ) / /( 1 ) Classificaion: For every class j=0,1 Compue he sum of weighs w corresponding o ALL classifiers ha predic class j; Oupu class ha correspond o he maximal sum of weighs (weighed majoriy) h final (x) arg max j w : h ( x ) j 13

Two-Class example. Classificaion. Classifier 1 yes 0.7 Classifier 2 no 0.3 Classifier 3 no 0.2 Weighed majoriy yes 0.7-0.5 = + 0.2 The final choose is yes + 1 Wha is boosing doing? Each classifier specializes on a paricular subse of examples Algorihm is concenraing on more and more difficul examples Boosing can: Reduce variance (he same as Bagging) Bu also o eliminae he effec of high bias of he weak learner (unlike Bagging) Train versus es errors performance: Train errors can be driven close o 0 Bu es errors do no show overfiing Proofs and heoreical explanaions in a number of papers 14

Boosing. Error performances 0.4 0.35 Training error Tes error Single-learner error 0.3 0.25 0.2 0.15 0.1 0.05 0 0 2 4 6 8 10 12 14 16 Model Averaging An alernaive o combine muliple models can be used for supervised and unsupervised frameworks For example: Likelihood of he daa can be expressed by averaging over he muliple models P ( D ) Predicion: N i 1 N P ( y x ) i 1 P ( D M m P ( y x, M i ) P ( M i m i ) m ) P ( M m i ) 15