ECE 5424: Introduction to Machine Learning

Similar documents
VBM683 Machine Learning

ECE 5984: Introduction to Machine Learning

Stochastic Gradient Descent

Voting (Ensemble Methods)

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

CS7267 MACHINE LEARNING

PAC-learning, VC Dimension and Margin-based Bounds

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Boos$ng Can we make dumb learners smart?

PAC-learning, VC Dimension and Margin-based Bounds

Learning Theory Continued

Decision Trees: Overfitting

Ensemble Methods for Machine Learning

Machine Learning. Ensemble Methods. Manfred Huber

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Bias-Variance Tradeoff

Statistical Machine Learning from Data

Data Mining und Maschinelles Lernen

Ensembles of Classifiers.

CSCI-567: Machine Learning (Spring 2019)

Learning with multiple models. Boosting.

ECE 5984: Introduction to Machine Learning

Lecture 8. Instructor: Haipeng Luo

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

A Brief Introduction to Adaboost

Learning Theory. Aar$ Singh and Barnabas Poczos. Machine Learning / Apr 17, Slides courtesy: Carlos Guestrin

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Variance Reduction and Ensemble Methods

The Perceptron algorithm

Announcements Kevin Jamieson

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

TDT4173 Machine Learning

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Ensembles. Léon Bottou COS 424 4/8/2010

Bias-Variance in Machine Learning

TDT4173 Machine Learning

FINAL: CS 6375 (Machine Learning) Fall 2014

BBM406 - Introduc0on to ML. Spring Ensemble Methods. Aykut Erdem Dept. of Computer Engineering HaceDepe University

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Learning theory. Ensemble methods. Boosting. Boosting: history

Machine Learning

6.036 midterm review. Wednesday, March 18, 15

Statistics and learning: Big Data

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Gradient Boosting (Continued)

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Machine Learning

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

i=1 = H t 1 (x) + α t h t (x)

Ensemble Methods and Random Forests

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

CS229 Supplemental Lecture notes

Midterm Exam Solutions, Spring 2007

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Bagging and Other Ensemble Methods

Logistic Regression Logistic

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Lecture 13: Ensemble Methods

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Generalization, Overfitting, and Model Selection

10701/15781 Machine Learning, Spring 2007: Homework 2

Support vector machines Lecture 4

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Logistic Regression. Machine Learning Fall 2018

Day 5: Generative models, structured classification

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Multiclass Boosting with Repartitioning

Ensemble Methods: Jay Hyer

Combining Classifiers

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

CS 6375 Machine Learning

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Machine Learning, Fall 2012 Homework 2

Machine Learning Lecture 10

Boosting: Foundations and Algorithms. Rob Schapire

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Click Prediction and Preference Ranking of RSS Feeds

Model Averaging With Holdout Estimation of the Posterior Distribution

Support Vector Machines

COMS 4771 Lecture Boosting 1 / 16

CS534 Machine Learning - Spring Final Exam

PDEEC Machine Learning 2016/17

Linear and Logistic Regression. Dr. Xiaowei Huang

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Introduction to Boosting and Joint Boosting

10-701/ Machine Learning - Midterm Exam, Fall 2010

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

Generative v. Discriminative classifiers Intuition

Machine Learning

Chapter 14 Combining Models

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Neural Networks and Deep Learning

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

Support Vector Machines

Transcription:

ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech

Fighting the bias-variance tradeoff Simple (a.k.a. weak) learners e.g., naïve Bayes, logistic regression, decision stumps (or shallow decision trees) Good: Low variance, don t usually overfit Bad: High bias, can t solve hard learning problems Sophisticated learners Kernel SVMs, Deep Neural Nets, Deep Decision Trees Good: Low bias, have the potential to learn with Big Data Bad: High variance, difficult to generalize Can we make combine these properties In general, No!! But often yes (C) Dhruv Batra Slide Credit: Carlos Guestrin 2

Ensemble Methods Core Intuition: A combination of multiple classifiers will perform better than a single classifier. Bagging Boosting (C) Stefan Lee 3

Ensemble Methods Instead of learning a single predictor, learn many predictors Output class: (Weighted) combination of each predictor With sophisticated learners Uncorrelated errors à expected error goes down On average, do better than single classifier! Bagging With weak learners each one good at different parts of the input space On average, do better than single classifier! Boosting (C) Dhruv Batra 4

Bagging (Bootstrap Aggregating / Bootstrap Averaging) Core Idea: Average multiple strong learners trained from resamples of your data to reduce variance and overfitting! (C) Stefan Lee 5

Bagging Given: Dataset of N Training Examples (x i, y i ) Sample N training points with replacement and train a predictor, repeat M times: Sample 1: h j (x) Sample M:.. h M (x) At test time, output the (weighted) average output of these predictors. (C) Stefan Lee 6

Why Use Bagging Let e m be the error for the m th predictor trained through bagging and e avg be the error of the ensemble. If E[e m ] = 0 (unbiased) and E[e m e k ] = E[e m ]E[e k ] (uncorrelated) then.. E e #$% = 1 M 1 M E e* The expected error of the average is a faction of the average expected error of the predictors! (C) Stefan Lee 7

When To Use Bagging In practice, completely uncorrelated predictors don t really happen, but there also wont likely be perfect correlation either, so bagging may still help! Use bagging when you have overfit sophisticated learners (averaging lowers variance) you have a somewhat reasonably sized dataset you want an extra bit of performance from your models (C) Stefan Lee 8

Example: Decision Forests We ve seen that single decision trees can easily overfit! Train a M trees on different samples of the data and call it a forest. Uncorrelated errors result in better ensemble performance. Can we force this? Could assign trees random max depths Could only give each tree a random subset of the splits Some work to optimize for no correlation as part of the object! (C) Stefan Lee

A Note From Statistics Bagging is a general method to reduce/estimate the variance of an estimator. Looking at the distribution of a estimator from multiple resamples of the data can give confidence intervals and bounds on that estimator. Typically just called Bootstrapping in this context. (C) Stefan Lee

Boosting Core Idea: Combine multiple weak learners to reduce error/bias by reweighting hard examples! (C) Stefan Lee 11

Some Intuition About Boosting Consider a weak learner h(x), for example a decision stump: x j < t x j x j >= t h(x) = a t h(x) = b h(x) = a h(x) = b x j Example for binary classification: h(x) = -1 h(x) = 1 x j t Example for regression: t h(x) = w - / x x j h(x) =w, - x (C) Stefan Lee 12

Some Intuition About Boosting Consider a weak learner h(x), for example a decision stump: x j < t x j x j >= t h(x) = a t h(x) = b h(x) = a h(x) = b x j This learner will make mistakes often but what if we combine multiple to combat these errors such that our final predictor is: f x = α 2 h 2 x + α 5 h 5 x + + α 782 h *82 x + α 7 h 7 (x) This is a big optimization problem now!! min >?,,> B C D H 1 N I L(y L, α 2 h 2 x + α 5 h 5 x + + α 782 h *82 x + α 7 h 7 x ) i Boosting will do this greedily, training one classifier at a time to correct the errors of the existing ensemble (C) Stefan Lee 13

Boosting Algorithm [Schapire, 1989] Pick a class of weak learners H = {h h : X Y } You have a black box that picks best weak learning unweighted sum weighted sum h = argmin h H h = argmin h H X L (y i,h(x i )) i X w i L (y i,h(x i )) i On each iteration t Compute error based on current ensemble tx f t 1 (x i )= Update weight of each training example based on it s error. t 0 =1 t 0h t 0(x i) Learn a predictor h t and strength for this predictor α N Update ensemble: f t (x) =f t 1 + t h t (x) 14

Boosting Demo Demo Matlab demo by Antonio Torralba http://people.csail.mit.edu/torralba/shortcourserloc/boosting/boosting.html (C) Dhruv Batra 15

Boosting: Weak to Strong Training Error 0,0 Size of Boosted Ensemble (M) As we add more boosted learners to our ensemble, error approaches zero (in the limit) need to decided when to stop based on a validation set don t use this on already overfit strong learners, will just become worse (C) Stefan Lee 16

Boosting Algorithm [Schapire, 1989] Pick a class of weak learners H = {h h : X Y } You have a black box that picks best weak learning unweighted sum weighted sum h = argmin h H h = argmin h H X L (y i,h(x i )) i X w i L (y i,h(x i )) i On each iteration t Compute error based on current ensemble Update weight of each training example based on it s error. Learn a predictor h t and strength for this predictor α N tx f t 1 (x i )= t 0h t 0(x i) t 0 =1 Update ensemble: f t (x) =f t 1 + t h t (x) 17

Boosting Algorithm [Schapire, 1989] We ve assumed we have some tools to find optimal learners, either X x L, y L LW2 X x L, y L, w L LW2 h = argmin 1 N I L(y L,h x L or h = argmin 1 N I w L L(y L,h x L L L h h To train the t th predictor, our job is to express the optimization for the new predictor in one of these forms 1 min >?,,> B N I L(y L, f N82 x + α N h N x ) C D H i Typically done by either changing y i or w i depending on L. (C) Stefan Lee 18

Types of Boosting Loss Name Loss Formula Boosting Name Regression: Squared Loss (y f(x)) 2 L2Boosting Regression: Absolute Loss y f(x) Gradient Boosting Classification: Exponential Loss e yf(x) AdaBoost Classification: Log/Logistic Loss log 1+e yf(x) LogitBoost (C) Dhruv Batra 19

L2 Boosting Loss Name Loss Formula Boosting Name Regression: Squared Loss (y f(x)) 2 L2Boosting Algorithm On Board (C) Dhruv Batra 20

Adaboost Loss Name Loss Formula Boosting Name Classification: Exponential Loss e yf(x) AdaBoost Algorithm You will derive in HW4! (C) Dhruv Batra 21

What you should know Voting/Ensemble methods Bagging How to sample Under what conditions is error reduced Boosting General algorithm L2Boosting derivation Adaboost derivation (from HW4) (C) Dhruv Batra 22

Learning Theory Probably Approximately Correct (PAC) Learning What does it formally mean to learn? (C) Dhruv Batra 23

Learning Theory We have explored many ways of learning from data But How good is our classifier, really? How much data do I need to make it good enough? (C) Dhruv Batra Slide Credit: Carlos Guestrin 24

A simple setting Classification N data points Finite space H of possible hypothesis e.g. decision trees on categorical variables of depth d A learner finds a hypothesis h that is consistent with training data Gets zero error in training error train (h) = 0 What is the probability that h has more than true error? P(error true (h) ε) (C) Dhruv Batra Slide Credit: Carlos Guestrin 25

Generalization error in finite hypothesis spaces [Haussler 88] Theorem: Hypothesis space H finite dataset D with N i.i.d. samples 0 < ε < 1 For any learned hypothesis h that is consistent (0 training error) on the training data: N Even if h makes zero errors in training data, may make errors in test (C) Dhruv Batra Slide Credit: Carlos Guestrin 26

Using a PAC bound N Let max acceptable P error N\]^ g > ε = δ Typically, 2 use cases: 1: Pick ε and δ, give you N I want no more than ε error with probability δ, how much data? 2: Pick N and δ, give you ε I have N data points and want to know my error ε with δ confidence. (C) Dhruv Batra Slide Credit: Carlos Guestrin 27

Haussler 88 bound N Strengths: Holds for all (finite) H Holds for all data distributions Weaknesses Consistent classifier (0 training error) Finite hypothesis space (C) Dhruv Batra Slide Credit: Carlos Guestrin 28

Generalization bound for H hypothesis Theorem: Hypothesis space H finite dataset D with N i.i.d. samples 0 < ε < 1 For any learned hypothesis h: N (C) Dhruv Batra Slide Credit: Carlos Guestrin 29

PAC bound and Bias-Variance tradeoff or, after moving some terms around, with probability at least 1-δ : N N Important: PAC bound holds for all h, but doesn t guarantee that algorithm finds best h!!! (C) Dhruv Batra Slide Credit: Carlos Guestrin 30