Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University

Similar documents
BAGGING PREDICTORS AND RANDOM FOREST

Chapter 6. Ensemble Methods

JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA

Gradient Boosting (Continued)

Data Mining und Maschinelles Lernen

ECE 5424: Introduction to Machine Learning

Lecture 13: Ensemble Methods

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Voting (Ensemble Methods)

Variance Reduction and Ensemble Methods

Ensemble Methods and Random Forests

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Harrison B. Prosper. Bari Lectures

CS7267 MACHINE LEARNING

LEAST ANGLE REGRESSION 469

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Statistics and learning: Big Data

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

PDEEC Machine Learning 2016/17

VBM683 Machine Learning

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

ECE 5984: Introduction to Machine Learning

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Gradient Boosting, Continued

Ensembles of Classifiers.

Non-linear Supervised High Frequency Trading Strategies with Applications in US Equity Markets

REGRESSION TREE CREDIBILITY MODEL

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Machine Learning for OR & FE

A Brief Introduction to Adaboost

SPECIAL INVITED PAPER

Regularization Paths

RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

ENSEMBLES OF DECISION RULES

Boosting. March 30, 2009

Machine Learning. Ensemble Methods. Manfred Huber

Conjugate direction boosting

Ensemble Methods: Jay Hyer

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

A Study of Relative Efficiency and Robustness of Classification Methods

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Stochastic Gradient Descent

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7]

IEEE TRANSACTIONS ON SYSTEMS, MAN AND CYBERNETICS - PART B 1

Ordinal Classification with Decision Rules

COMPSTAT2010 in Paris. Hiroki Motogaito. Masashi Goto

Why does boosting work from a statistical view

ABC random forest for parameter estimation. Jean-Michel Marin

STK-IN4300 Statistical Learning Methods in Data Science

Learning theory. Ensemble methods. Boosting. Boosting: history

Does Modeling Lead to More Accurate Classification?

BART: Bayesian additive regression trees

Multivariate Analysis Techniques in HEP

Classification using stochastic ensembles

Learning with multiple models. Boosting.

Lecture 12: Ensemble Methods. Introduction. Weighted Majority. Mixture of Experts/Committee. Σ k α k =1. Isabelle Guyon

Announcements Kevin Jamieson

MS-C1620 Statistical inference

Prediction & Feature Selection in GLM

Classification of Longitudinal Data Using Tree-Based Ensemble Methods

Boosting & Deep Learning

Bayesian Ensemble Learning

Boosting. Jiahui Shen. October 27th, / 44

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Generalized Boosted Models: A guide to the gbm package

A Simple Algorithm for Learning Stable Machines

Least Angle Regression, Forward Stagewise and the Lasso

Ensemble Methods for Machine Learning

Lossless Online Bayesian Bagging

Multiple Additive Regression Trees with Application in Epidemiology

1 History of statistical/machine learning. 2 Supervised learning. 3 Two approaches to supervised learning. 4 The general learning procedure

Variable Selection and Weighting by Nearest Neighbor Ensembles

Incremental Training of a Two Layer Neural Network

A Magiv CV Theory for Large-Margin Classifiers

Learning From Data Lecture 15 Reflecting on Our Path - Epilogue to Part I

Look before you leap: Some insights into learner evaluation with cross-validation

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Proteomics and Variable Selection

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Linear Classifiers: Expressiveness

Machine Learning Lecture 7

Stat 587: Key points and formulae Week 15

1 Handling of Continuous Attributes in C4.5. Algorithm

Adaptive Crowdsourcing via EM with Prior

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Ensembles. Léon Bottou COS 424 4/8/2010

Relationship between Loss Functions and Confirmation Measures

Variable Selection in Data Mining Project

Neural Networks and Ensemble Methods for Classification

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Transcription:

Importance Sampling: An Alternative View of Ensemble Learning Jerome H. Friedman Bogdan Popescu Stanford University 1

PREDICTIVE LEARNING Given data: {z i } N 1 = {y i, x i } N 1 q(z) y = output or response attribute (variable) x = {x 1,, x n } = inputs or predictors and loss function L(y, F ): estimate F (x) = arg min F (x) E q(z) L(y, F (x)) 2

WHY? F (x) is best predictor of y x under L. Examples: Regression: y, F R L(y, F ) = y F, (y F ) 2 Classification: y, F {c 1,, c K } L(y, F ) = L y,f (K K matrix) 3

F (x) = target function (regression) concept (classification) Estimate: ˆF (x) learning procedure ({zi } N 1 ) Here: procedure = LEARNING ENSEMBLES 4

BASIC LINEAR MODEL F (x) = a(p) f(x; p) dp P f(x; p) = base learner (basis function) parameters: p = (p 1, p 2, ) p P indexes particular function of x from { f(x; p)} p P a(p) = coefficient of f(x; p) 5

Examples: f(x; p) = [1 + exp( p t x)] 1 (neural nets) = multivariate splines (MARS) = decision trees (Mart, RF) 6

NUMERICAL QUADRATURE P I(p) dp M m=1 w mi(p m ) here: I(p) = a(p) f(x; p) Quadrature rule defined by: {p m } M 1 = evaluation points P {w m } M 1 = weights 7

M F (x) w m a(p m ) f(x; p m ) m=1 M c m f(x; p m ) m=1 Averaging over x: {c m} M 1 = linear regression of y on {f(x; p m )} M 1 Problem: find good {p m } M 1. (pop.) 8

MONTE CARLO METHODS r(p) = sampling pdf of p P {p m r(p)} M 1 Simple Monte Carlo: r(p) = constant Usually not very good 9

IMPORTANCE SAMPLING Customize r(p) for each particular problem (F (x)) r(p m ) = big = p m important to high accuracy when used with {p m } m m 10

MONTE CARLO METHODS (1) Random Monte Carlo: ignore other points: p m r(p) iid (2) Quasi Monte Carlo: {p m } M 1 = deterministic account for other points importance groups of points 11

RANDOM MONTE CARLO (Lack of) importance J(p) depends only on p One measure: partial importance J(p) = E q(z) L(y, f(x; p)) p = arg min p J(p) = best single point (M = 1) rule f(x; p ) = optimal single base learner 12

Usually not very good, especially if F (x) / {f(x; p)} p P BUT, often used: single logistic regression or tree Note: J(p m ) ignores {p m } m m Hope: better than r(p) = constant. 13

PARTIAL IMPORTANCE SAMPLING r(p) = g(j(p)) g( ) = monotone decreasing function r(p ) = max center (location) p p = r(p) < r(p ) d(p, p ) = J(p) J(p ) 14

Besides location, Critical parameter for importance sampling: Scale (width) of r(p): σ = P d(p, p ) r(p) dp Controlled by choice of g( ): σ = too large r(p) = constant. σ = too small best single point rule p 15

0 5 10 15 20 25 1.0 0.5 0.0 0.5 1.0 16

Questions: (1) how to choose g( ) σ (2) sample from r(p) = g(j(p)) 17

TRICK Perturbation sampling repeatedly: (1) randomly modify (perturb) problem (2) find optimal f(x; p m ) for perturbed problem p m = R m { arg minp E q(z) L(y, f(x; p)) } control width σ of r(p) by degree Perturb: L(y, F ), q(z), algorithm, hybrid. 18

EXAMPLES Perturb loss function: L m (y, f) = L(y, f) + η l m (y, f) l m (y, f) = random function L m (y, f) = L(y, f + η h m (x)) h m (x) = random function of x p m = arg min p E q(z) L m (y, f(x; p)) Width σ of r(p) value of η 19

Perturb sampling distribution: q m (z) = [w m (z)] η q(z) w m (z) = random function of z p m = arg min p E qm(z)l(y, f(x; p)) Width σ of r(p) value of η 20

Perturb algorithm: p m = rand[arg min p ]E q(z) L(y, f(x; p)) control width σ of r(p) by degree repeated partial optimizations perturb partial solutions Examples: Dittereich - random trees Breiman - random forests 21

GOAL Produce a good {p m } M 1 so that M c mf(x; p m ) F (x) m=1 where {c m} M 1 = pop. linear regression (L) of y on {f(x; p m )} M 1 Note: both depend on knowing population q(z). 22

FINITE DATA {z i } N 1 q(z) ˆq(z) = N i=1 1 δ(z z N i) Apply perturbation sampling based on ˆq(z): Loss function / algorithm: q(z) ˆq(z) width σ of r(p) controlled as before 23

Sampling distribution: random reweighting ˆq m (z) = N i=1 w imδ(z z i ) w im Pr(w) : Ew im = 1/N width σ of r(p) controlled by std(w im ) Fastest comp: w im {0, 1/K} draw K from N without replacement σ std(w) = (N/K 1) 1/2 /N computation K/N 24

Quadrature Coefficients Population: Linear regression of y on {f(x; ( p m )} M 1 : {c m} M 1 = arg min {cm} E q(z) L y, ) M m=1 c m f(x; p m ) 25

Finite data: regularized linear regression ( {ĉ m } M 1 = arg min {cm} Eˆq(z) L y, ) M m=1 c m f(x; p m ) +λ M m=1 c m c (0) m (Lasso) Regularization reduce variance {c (0) m } M 1 = prior guess (usually = 0) λ > 0 chosen by cross validation Fast algorithm: sol ns for all λ (see HTF 2001, EHT 2002) 26

Importance Sampled Learning Ensembles (ISLE) Numerical Integration F (x) = a(p) f(x; p) dp P M m=1 ĉm f(x; ˆp m ) {ˆp m } M 1 ˆr(p) importance sampling perturbation sampling on ˆq(z) {ĉ m } M 1 regularized linear regression of y on {f(x; ˆp m )} M 1 27

BAGGING (Breiman 1996) Perturb data distribution ˆq(z): ˆq m (z) = bootstrap sample = N i=1 w imδ(z z i ) w im {0, 1/N, 2/N,, 1} multinomial (1/N) p m = arg min p Eˆqm(z)L(y, f(x; p)) ˆF (x) = M 1 m=1 f(x; p M m) (average) 28

Width σ of r(p): E(std(w im )) = (1 1/N) 1 /N 1/N Fixed no control No joint fitting of coefficients: λ = & c (0) m = 1/M Potential improvements: Different σ (sampling strategy) λ < jointly fit coeff s to data 29

RANDOM FORESTS (Breiman 1998) f(x; p) = T (x) = largest possible decision tree Hybrid sampling strategy : (1) ˆq m (z) = bootstrap sample (bagging) (2) random algorithm modification: select var. for each split from among randomly chosen subset Breiman: n s = fl(log 2 n + 1) 30

ˆF (x) = M 1 m=1 T M m(x) (average) As an ISLE: σ(rf ) > σ(bag), ( as n s ) Potential improvements: same as bagging Different σ (sampling strategy) λ < jointly fit coeff s to data (more later) 31

SEQUENTIAL SAMPLING Random Monte Carlo: {p m r(p)} M 1 iid Quasi Monte Carlo: {p m } M 1 = deterministic ( J({p m } M 1 ) = min {αm} E q(z) L y, ) M m=1 α m f(x; p m ) Joint regression of y on { f(x; p m )} M 1 (pop.) 32

Approximation: sequential sampling (forward stagewise) J m (p {p l } m 1 1 ) = min α E q(z) L(y, α f(x; p) + h m (x)) h m (x) = m 1 l=1 α l f(x; p l ), α l = sol n for p l p m = arg min p J m (p {p l } m 1 1 ) Repeatedly modify loss function: Similar to L m (y, f) = L(y, f) + η h m (x)) but here η = 1 & h m (x) = deterministic 33

Connection to Boosting AdaBoost (Freund & Shapire 1996): L(y, f) = exp( y ( f), y { 1, 1} M ) ˆF (x) = sign m=1 α m f(x; p m ) {α m } M 1 = sequential partial reg. coeff s Gradient Boosting (MART Friedman 2001): general y & L(y, f), α m = shrunk (η << 1) ˆF (x) = M m=1 α m f(x; p m ) 34

Potential improvements (ISLE): (1) ˆF (x) = M m=1 ĉm f(x; ˆp m ) {ˆp m } M 1 seq. sampling on ˆq(x) {ĉ m } M 1 regularized linear regression (2) and/or hybrid with random ˆq m (z) (speed) (sample K from N without replacement) 35

ISLE Paradigm Wide variety of ISLE methods: (1) base learner f(x; p); (2) loss criterion L(y, f) (3) perturbation method (4) degree of perturbation: σ of r(p) (5) iid vs. sequential (6) hybrids Examine several options. 36

Monte Carlo Study 100 data sets: each N = 10000, n = 40 {y il = F l (x i ) + ε il } 10000 i=1, l = 1, 100 {F l (x)} 100 1 = different (random) target fun s. x i N(0, I 40 ), ε il N(0, V ar x (F l (x))) signal/noise = 1/1 37

Evaluation Criteria Relative RMS error: rmse( ˆF jl ) = [1 R 2 (F l, ˆF jl )] 1/2 Comparative RMS error: cmse = rmse( ˆF jl )/ min k {rmse( ˆF kl )} (adjusts for problem difficulty) j, k {respective methods} 10000 indep. obs. 38

Properties of F l (x) (1) 30 noise variables (2) wide variety of fun s (difficulty) (3) emphasize lower order interactions (3) not in span of base learners Decision Trees 39

Bagging Relative RMS Error Relative RMS error 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Bag Bag_P Bag_.05 Bag_.05_P Bag_6 Bag_6_P Bag_6_.05_P 40

Bagging Comparative RMS Error Comparative RMS error 1.0 1.2 1.4 1.6 1.8 2.0 Bag Bag_P Bag_.05 Bag_.05_P Bag_6 Bag_6_P Bag_6_.05_P 41

Random Forests Comparative RMS error Comparative RMS error 1.0 1.5 2.0 2.5 RF RF_P RF_.05 RF_.05_P RF_6 RF_6_P RF_6_.05_P 42

Bag/RF Comparative RMS Error Comparative RMS error 1.0 1.2 1.4 1.6 Bag RF Bag_6_05_P RF_6_05_P 43

Sequential Sampling Comparative RMS Error Comparative RMS error 1.0 1.1 1.2 1.3 Mart Mart_P Mart_10_P Mart_.01_10_P Mart_.01_20_P 44

Seq/Bag/RF Comparative RMS Error 1.0 1.2 1.4 1.6 1.8 2.0 2.2 Mart Mart_P Seq_.01_.2 Bag Bag_6_.05_p RF RF_6_.05_p 45

SUMMARY: Theory unify: (1) Bagging, (2) random forests, (3) Bayesian model averaging, (4) boosting single paradigm Monte Carlo integration (1) (3) : iid Monte Carlo, p r(p) (1), (2) perturb. sampling; (3) MCMC (4) quasi Monte Carlo: approx. seq. sampling 46

Practice: {w i } M 1 lasso linear regression (1) improves accuracy of RF bagging (faster) (2) combined with aggressive subsampling & weaker base learners, improves speed: bagging & RF > 10 2, MART 5 allowing much bigger data sets. Also, prediction many times faster. 47

FUTURE DIRECTIONS (1) More thorough understanding (theory) specific recommendations 48

(2) Multiple learning ensembles (MISLES) F (x) = K k=1 P k a k (p k ) f k (x, p k ) dp k { f k (x, p k )} K 1 = different (comp.) base learners F (x) = c km f k (x, p k ) {c km } combined lasso regression Example: f 1 = decision trees f 2 = {x j } n 1 (no sampling) 49

SLIDES http://www-stat.stanford.edu/ jhf/isletalk.pdf REFERENCES HTF (2001): Hastie, Tibshirani, & Friedman, Elements of Statistical Learning, Springer. EHT (2002): Efron, Hastie, & Tibshirani, Least angle regression (Stanford preprint) 50