STK-IN4300 Statistical Learning Methods in Data Science

Similar documents
STK-IN4300 Statistical Learning Methods in Data Science

STK-IN4300 Statistical Learning Methods in Data Science

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Does Modeling Lead to More Accurate Classification?

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Chapter 6. Ensemble Methods

Stochastic Gradient Descent

A Study of Relative Efficiency and Robustness of Classification Methods

LogitBoost with Trees Applied to the WCCI 2006 Performance Prediction Challenge Datasets

Boosting. March 30, 2009

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

STK-IN4300 Statistical Learning Methods in Data Science

Machine Learning. Ensemble Methods. Manfred Huber

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

A Brief Introduction to Adaboost

Voting (Ensemble Methods)

A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives

ECE 5984: Introduction to Machine Learning

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

Generalized Boosted Models: A guide to the gbm package

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Learning with multiple models. Boosting.

ECE 5424: Introduction to Machine Learning

Evidence Contrary to the Statistical View of Boosting

Why does boosting work from a statistical view

STK-IN4300 Statistical Learning Methods in Data Science

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Evidence Contrary to the Statistical View of Boosting

Announcements Kevin Jamieson

Logistic Regression and Boosting for Labeled Bags of Instances

Gradient Boosting (Continued)

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

Chapter 14 Combining Models

VBM683 Machine Learning

Lecture 8. Instructor: Haipeng Luo

Data Mining und Maschinelles Lernen

Lecture 13: Ensemble Methods

Regularization Paths

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

CS7267 MACHINE LEARNING

Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

LEAST ANGLE REGRESSION 469

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Optimal and Adaptive Algorithms for Online Boosting

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

Applied Statistics and Machine Learning

Decision Trees: Overfitting

ABC-Boost: Adaptive Base Class Boost for Multi-class Classification

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

SPECIAL INVITED PAPER

Boosting Based Conditional Quantile Estimation for Regression and Binary Classification

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Conjugate direction boosting

PDEEC Machine Learning 2016/17

Multivariate Analysis Techniques in HEP

10-701/ Machine Learning - Midterm Exam, Fall 2010

Statistical Data Mining and Machine Learning Hilary Term 2016

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Regularization Paths. Theme

Boosting. 1 Boosting. 2 AdaBoost. 2.1 Intuition

CS229 Supplemental Lecture notes

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Logistic Regression. Machine Learning Fall 2018

Ordinal Classification with Decision Rules

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Boosting kernel density estimates: A bias reduction technique?

Monte Carlo Theory as an Explanation of Bagging and Boosting

A ROBUST ENSEMBLE LEARNING USING ZERO-ONE LOSS FUNCTION

optimization view of statistical inference, and its most obvious consequence has been the emergence of many variants of the original AdaBoost, under v

Model-based boosting in R

Boosting in Cox regression: a comparison between the likelihood-based and the model-based approaches with focus on the R-packages CoxBoost and mboost

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Boosted Lasso and Reverse Boosting

CSCI-567: Machine Learning (Spring 2019)

Iron Ore Rock Recognition Trials

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines for Classification: A Statistical Portrait

mboost - Componentwise Boosting for Generalised Regression Models

The Margin Vector, Admissible Loss and Multi-class Margin-based Classifiers

Harrison B. Prosper. Bari Lectures

Boosting. Jiahui Shen. October 27th, / 44

CS534 Machine Learning - Spring Final Exam

Ensemble Methods: Jay Hyer

Statistics and learning: Big Data

Ensemble Methods for Machine Learning

Margin Maximizing Loss Functions

Learning theory. Ensemble methods. Boosting. Boosting: history

Hierarchical Boosting and Filter Generation

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Transcription:

Outline of the lecture STK-IN4300 Statistical Learning Methods in Data Science Riccardo De Bin debin@math.uio.no AdaBoost Introduction algorithm Statistical Boosting Boosting as a forward stagewise additive modelling Why exponential loss? Gradient boosting STK-IN4300: lecture 10 1/ 28 STK-IN4300: lecture 10 2/ 28 AdaBoost: introduction L. Breiman: [Boosting is] the best off-shelf classifier in the world. originally developed for classification; as a pure machine learning black-box; translated into the statistical world (Friedman et al., 2000); extended to every statistical problem (Mayr et al., 2014), regression; survival analysis;... interpretable models, thanks to the statistical view; extended to work in high-dimensional settings (Bühlmann, 2006). AdaBoost: introduction Starting challenge: Can [a committee of blockheads] somehow arrive at highly reasoned decisions, despite the weak judgement of the individual members? (Schapire & Freund, 2014) Goal: create a good classifier by combining several weak classifiers, in classification, a weak classifier is a classifier which is able to produce results only slightly better than a random guess; Idea: apply repeatedly (iteratively) a weak classifier to modifications of the data, at each iteration, give more weight to the misclassified observations. STK-IN4300: lecture 10 3/ 28 STK-IN4300: lecture 10 4/ 28

AdaBoost: introduction STK-IN4300: lecture 10 5/ 28 AdaBoost: algorithm Consider a two-class classification problem, y i P t 1, 1u, x i P R p. AdaBoost algorithm: 1. initialize the weights, w r0s p1{n,..., 1{Nq; 2. for m from 1 to m stop, (a) fit the weak estimator Gp q to the weighted data; (b) compute the weighted in-sample missclassification rate, err rms ř N wrm 1s i 1 i 1py i Ĝrms px i qq; (c) compute the voting weights, α rms logpp1 err rms q{err rms q; (d) update the weights w i w rm 1s i w rms i exptα rms 1py i Ĝrms px 1qqu; w i{ ř N i 1 wi; 3. compute the final result, mÿ stop Ĝ AdaBoost signp α rms Ĝ rms px 1 qq m 1 STK-IN4300: lecture 10 6/ 28 First iteration: apply the classier Gp q on observations with weights: w i 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 0.10 observations 1, 2 and 3 are misclassified ñ err r1s 0.3; compute α r1s 0.5 logpp1 err r1s q{err r1s q «0.42; set w i w i exptα r1s 1py i Ĝr1s px i qqu: w i 0.15 0.15 0.15 0.07 0.07 0.07 0.07 0.07 0.07 0.07 STK-IN4300: lecture 10 7/ 28 STK-IN4300: lecture 10 8/ 28

Second iteration: apply classifier Gp q on re-weighted observations (w i { ř i w i): w i 0.17 0.17 0.17 0.07 0.07 0.07 0.07 0.07 0.07 0.07 observations 6, 7 and 9 are misclassified ñ err r2s «0.21; compute α r2s 0.5 logpp1 err r2s q{err r2s q «0.65; set w i w i exptα r2s 1py i Ĝr2s px i qqu: w i 0.09 0.09 0.09 0.04 0.04 0.14 0.14 0.04 0.14 0.04 STK-IN4300: lecture 10 9/ 28 STK-IN4300: lecture 10 10/ 28 Third iteration: apply classifier Gp q on re-weighted observations (w i { ř i w i): w i 0.11 0.11 0.11 0.05 0.05 0.17 0.17 0.05 0.17 0.05 observations 4, 5 and 8 are misclassified ñ err r3s «0.14; compute α r3s 0.5 logpp1 err r3s q{err r3s q «0.92; set w i w i exptα r3s 1py i Ĝr3s px i qqu: w i 0.04 0.04 0.04 0.11 0.11 0.07 0.07 0.11 0.07 0.02 STK-IN4300: lecture 10 11/ 28 STK-IN4300: lecture 10 12/ 28

STK-IN4300: lecture 10 13/ 28 STK-IN4300: lecture 10 14/ 28 Statistical Boosting: Boosting as a forward stagewise additive modelling Statistical Boosting: Boosting as a forward stagewise additive modelling The statistical view of boosting is based on the concept of forward stagewise additive modelling: minimizes a loss function Lpy i, fpx i qq; using an additive model, fpxq ř M m 1 β mbpx; γ m q; (see notes) bpx; γm q is the basis, or weak learner; at each step, ř pβ m, γ m q argmin N β,γ i 1 Lpy i, f m 1 px i q ` βbpx i ; γqq; the estimate is updated as f m pxq f m 1 pxq ` β m bpx; γ m q e.g., in AdaBoost, β m α m {2, bpx; γ m q Gpxq; STK-IN4300: lecture 10 15/ 28 STK-IN4300: lecture 10 16/ 28

Statistical Boosting: Why exponential loss? Statistical Boosting: Why exponential loss? The statistical view of boosting: allows to interpret the results; by studying the properties of the exponential loss; It is easy to show that i.e. f pxq argmin fpxq E Y X x re Y fpxq s 1 2 P rpy 1 xq 1 ` 1 e 2f pxq ; log P rpy 1 xq P rpy 1 xq, therefore AdaBoost estimates 1/2 the log-odds of P rpy 1 xq. Note: the exponential loss is not the only possible loss-function; deviance (cross/entropy): binomial negative log-likelihood, where: lpπ x q y 1 logpπ x q p1 y 1 q logp1 π x q, y 1 py ` 1q{2, i.e., y 1 P t0, 1u; π x PrpY 1 X xq e fpxq e fpxq`e fpxq 1 1`e 2fpxq ; equivalently, lpπ x q logp1 ` e 2yfpxq q. same population minimizers for Er lpπ x qs and Er Y fpxqs. STK-IN4300: lecture 10 17/ 28 STK-IN4300: lecture 10 18/ 28 Statistical Boosting: Why exponential loss? Statistical Boosting: Gradient boosting We saw that AdaBoost iteratively minimizes a loss function. In general, consider Lpfq ř N i 1 Lpy i, fpx i qq; ˆf argmin f Lpfq; the minimization problem can be solved by considering where: f mstop mÿ stop m 0 h m f0 h 0 is the initial guess; each fm improves the previous f m 1 though h m ; hm is called step. STK-IN4300: lecture 10 19/ 28 STK-IN4300: lecture 10 20/ 28

The steepest descent chooses where Statistical Boosting: steepest descent h m ρ m g m g m P R N is the gradient descent of Lpfq evaluated at f f m 1 and represents the direction for the minimization, g im BLpy i, fpx i qq ˇ Bfpx i q ˇfpxi q f m 1 px i q ρ m is a scalar and tells how much to minimize ρ m argmin ρ Lpf m 1 ρg m q. Statistical Boosting: example Consider the linear Gaussian regression case: Lpy, fpxqq 1 2 ř N i 1 py i fpx i qq 2 ; fpxq X T β; initial guess: β 0. Therefore: g B 1 2 ř N i 1 py i fpx i qq 2 Bfpx i q py X T βq; g m py X T βqˇˇβ 0 y; ρ m argmin ρ 1 2 py ρyq2 Ñ ρ m XpX T Xq 1 X T. Note: overfitting! STK-IN4300: lecture 10 21/ 28 STK-IN4300: lecture 10 22/ 28 Statistical Boosting: shrinkage To regularize the procedure, a shrinkage factor is introduced, where 0 ă ν ă 1. f m pxq f m 1 pxq ` νh m Moreover, h m can be a general weak learner (base learner): stump; spline;... the idea is to fit the base learner to the gradient descent to iteratively minimize the loss function. Statistical Boosting: Gradient boosting Gradient boosting algorithm: 1. initialize the estimate, e.g., f 0 pxq 0; 2. for m 1,..., m stop, 2.1 compute the negative gradient vector, u m BLpy,fpxqq ˇ Bfpxq ˇfpxq ˆfm 1pxq; 2.2 fit the base learner to the negative gradient vector, h m pu m, xq; 2.3 update the estimate, f m pxq f m 1 pxq ` νh m pu m, xq. 3. final estimate, ˆf mstop pxq ř m stop m 1 νh mpu m, xq Note: u m g m ˆf mstop pxq is a GAM. STK-IN4300: lecture 10 23/ 28 STK-IN4300: lecture 10 24/ 28

Statistical Boosting: example Consider again the linear Gaussian regression case: Lpy, fpxqq 1 2 ř N i 1 py i fpx i, βqq 2, fpx i, βq x i β; hpy, Xq XpX T Xq 1 X T y. Therefore: initialize the estimate, e.g., ˆf 0 px, βq 0; m 1, u 1 BLpy,fpX,βqq ˇ BfpX,βq py 0q y; ˇfpX,βq ˆf0pX,βq h 1 pu 1, Xq XpX T Xq 1 X T y; ˆf1 pxq 0 ` νxpx T Xq 1 X T y. m 2, u 2 BLpy,fpX,βqq BfpX,βq ˇ py X ˇfpX,βq T pν ˆβqq; ˆf1pX,βq h 2 pu 2, Xq XpX T Xq 1 X T py X T pν ˆβqq; update the estimate, ˆf2 px, βq νxpx T Xq 1 X T y ` νxpx T Xq 1 X T py X T pν ˆβqq. STK-IN4300: lecture 10 25/ 28 Statistical Boosting: remarks Note that: we do not need to have a linear effects, hpy, Xq can be, e.g., a spline; using fpx, βq X T β, it makes more sense to work with β: 1. initialize the estimate, e.g., ˆβ 0 0; 2. for m 1,..., m stop, 2.1 compute the negative gradient vector, u m BLpy,fpX,βqq ˇ ; BfpX,βq ˇβ ˆβm 1 2.2 fit the base learner to the negative gradient vector, b mpu m, Xq px T Xq 1 X T u m; 2.3 update the estimate, ˆβ m ˆβ m 1 ` νb mpu m, xq. 3. final estimate, ˆf mstop pxq X T ˆβ m. STK-IN4300: lecture 10 26/ 28 Statistical Boosting: remarks References I Further remarks: for m stop Ñ 8, ˆβ mstop Ñ ˆβ OLS ; the shrinkage is controlled by both m stop and ν; usually ν is fixed, ν 0.1 m stop is computed by cross-validation: it controls the model complexity; we need an early stop to avoid overfitting; if it is too small Ñ too much bias; if it is too large Ñ too much variance; Bühlmann, P. (2006). Boosting for high-dimensional linear models. The Annals of Statistics 34, 559 583. Friedman, J., Hastie, T. & Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting. The Annals of Statistics 28, 337 407. Mayr, A., Binder, H., Gefeller, O. & Schmid, M. (2014). Extending statistical boosting. Methods of Information in Medicine 53, 428 435. Schapire, R. E. & Freund, Y. (2014). Boosting: Foundations and Algorithms. MIT Press, Cambridge. the predictors must be centred, ErX j s 0. STK-IN4300: lecture 10 27/ 28 STK-IN4300: lecture 10 28/ 28