Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University

Importance Sampling: An Alternative View of Ensemble Learning Jerome H. Friedman Bogdan Popescu Stanford University 1

PREDICTIVE LEARNING Given data: {z i } N 1 = {y i, x i } N 1 q(z) y = output or response attribute (variable) x = {x 1,, x n } = inputs or predictors and loss function L(y, F ): estimate F (x) = arg min F (x) E q(z) L(y, F (x)) 2

WHY? F (x) is best predictor of y x under L. Examples: Regression: y, F R L(y, F ) = y F, (y F ) 2 Classification: y, F {c 1,, c K } L(y, F ) = L y,f (K K matrix) 3

F (x) = target function (regression) concept (classification) Estimate: ˆF (x) learning procedure ({zi } N 1 ) Here: procedure = LEARNING ENSEMBLES 4

BASIC LINEAR MODEL F (x) = a(p) f(x; p) dp P f(x; p) = base learner (basis function) parameters: p = (p 1, p 2, ) p P indexes particular function of x from { f(x; p)} p P a(p) = coefficient of f(x; p) 5

Examples: f(x; p) = [1 + exp( p t x)] 1 (neural nets) = multivariate splines (MARS) = decision trees (Mart, RF) 6

NUMERICAL QUADRATURE P I(p) dp M m=1 w mi(p m ) here: I(p) = a(p) f(x; p) Quadrature rule defined by: {p m } M 1 = evaluation points P {w m } M 1 = weights 7

M F (x) w m a(p m ) f(x; p m ) m=1 M c m f(x; p m ) m=1 Averaging over x: {c m} M 1 = linear regression of y on {f(x; p m )} M 1 Problem: find good {p m } M 1. (pop.) 8

MONTE CARLO METHODS r(p) = sampling pdf of p P {p m r(p)} M 1 Simple Monte Carlo: r(p) = constant Usually not very good 9

IMPORTANCE SAMPLING Customize r(p) for each particular problem (F (x)) r(p m ) = big = p m important to high accuracy when used with {p m } m m 10

MONTE CARLO METHODS (1) Random Monte Carlo: ignore other points: p m r(p) iid (2) Quasi Monte Carlo: {p m } M 1 = deterministic account for other points importance groups of points 11

RANDOM MONTE CARLO (Lack of) importance J(p) depends only on p One measure: partial importance J(p) = E q(z) L(y, f(x; p)) p = arg min p J(p) = best single point (M = 1) rule f(x; p ) = optimal single base learner 12

Usually not very good, especially if F (x) / {f(x; p)} p P BUT, often used: single logistic regression or tree Note: J(p m ) ignores {p m } m m Hope: better than r(p) = constant. 13

PARTIAL IMPORTANCE SAMPLING r(p) = g(j(p)) g( ) = monotone decreasing function r(p ) = max center (location) p p = r(p) < r(p ) d(p, p ) = J(p) J(p ) 14

Besides location, Critical parameter for importance sampling: Scale (width) of r(p): σ = P d(p, p ) r(p) dp Controlled by choice of g( ): σ = too large r(p) = constant. σ = too small best single point rule p 15

0 5 10 15 20 25 1.0 0.5 0.0 0.5 1.0 16

Questions: (1) how to choose g( ) σ (2) sample from r(p) = g(j(p)) 17

TRICK Perturbation sampling repeatedly: (1) randomly modify (perturb) problem (2) find optimal f(x; p m ) for perturbed problem p m = R m { arg minp E q(z) L(y, f(x; p)) } control width σ of r(p) by degree Perturb: L(y, F ), q(z), algorithm, hybrid. 18

EXAMPLES Perturb loss function: L m (y, f) = L(y, f) + η l m (y, f) l m (y, f) = random function L m (y, f) = L(y, f + η h m (x)) h m (x) = random function of x p m = arg min p E q(z) L m (y, f(x; p)) Width σ of r(p) value of η 19

Perturb sampling distribution: q m (z) = [w m (z)] η q(z) w m (z) = random function of z p m = arg min p E qm(z)l(y, f(x; p)) Width σ of r(p) value of η 20

Perturb algorithm: p m = rand[arg min p ]E q(z) L(y, f(x; p)) control width σ of r(p) by degree repeated partial optimizations perturb partial solutions Examples: Dittereich - random trees Breiman - random forests 21

GOAL Produce a good {p m } M 1 so that M c mf(x; p m ) F (x) m=1 where {c m} M 1 = pop. linear regression (L) of y on {f(x; p m )} M 1 Note: both depend on knowing population q(z). 22

FINITE DATA {z i } N 1 q(z) ˆq(z) = N i=1 1 δ(z z N i) Apply perturbation sampling based on ˆq(z): Loss function / algorithm: q(z) ˆq(z) width σ of r(p) controlled as before 23

Sampling distribution: random reweighting ˆq m (z) = N i=1 w imδ(z z i ) w im Pr(w) : Ew im = 1/N width σ of r(p) controlled by std(w im ) Fastest comp: w im {0, 1/K} draw K from N without replacement σ std(w) = (N/K 1) 1/2 /N computation K/N 24

Quadrature Coefficients Population: Linear regression of y on {f(x; ( p m )} M 1 : {c m} M 1 = arg min {cm} E q(z) L y, ) M m=1 c m f(x; p m ) 25

Finite data: regularized linear regression ( {ĉ m } M 1 = arg min {cm} Eˆq(z) L y, ) M m=1 c m f(x; p m ) +λ M m=1 c m c (0) m (Lasso) Regularization reduce variance {c (0) m } M 1 = prior guess (usually = 0) λ > 0 chosen by cross validation Fast algorithm: sol ns for all λ (see HTF 2001, EHT 2002) 26

Importance Sampled Learning Ensembles (ISLE) Numerical Integration F (x) = a(p) f(x; p) dp P M m=1 ĉm f(x; ˆp m ) {ˆp m } M 1 ˆr(p) importance sampling perturbation sampling on ˆq(z) {ĉ m } M 1 regularized linear regression of y on {f(x; ˆp m )} M 1 27

BAGGING (Breiman 1996) Perturb data distribution ˆq(z): ˆq m (z) = bootstrap sample = N i=1 w imδ(z z i ) w im {0, 1/N, 2/N,, 1} multinomial (1/N) p m = arg min p Eˆqm(z)L(y, f(x; p)) ˆF (x) = M 1 m=1 f(x; p M m) (average) 28

Width σ of r(p): E(std(w im )) = (1 1/N) 1 /N 1/N Fixed no control No joint fitting of coefficients: λ = & c (0) m = 1/M Potential improvements: Different σ (sampling strategy) λ < jointly fit coeff s to data 29

RANDOM FORESTS (Breiman 1998) f(x; p) = T (x) = largest possible decision tree Hybrid sampling strategy : (1) ˆq m (z) = bootstrap sample (bagging) (2) random algorithm modification: select var. for each split from among randomly chosen subset Breiman: n s = fl(log 2 n + 1) 30

ˆF (x) = M 1 m=1 T M m(x) (average) As an ISLE: σ(rf ) > σ(bag), ( as n s ) Potential improvements: same as bagging Different σ (sampling strategy) λ < jointly fit coeff s to data (more later) 31

SEQUENTIAL SAMPLING Random Monte Carlo: {p m r(p)} M 1 iid Quasi Monte Carlo: {p m } M 1 = deterministic ( J({p m } M 1 ) = min {αm} E q(z) L y, ) M m=1 α m f(x; p m ) Joint regression of y on { f(x; p m )} M 1 (pop.) 32

Approximation: sequential sampling (forward stagewise) J m (p {p l } m 1 1 ) = min α E q(z) L(y, α f(x; p) + h m (x)) h m (x) = m 1 l=1 α l f(x; p l ), α l = sol n for p l p m = arg min p J m (p {p l } m 1 1 ) Repeatedly modify loss function: Similar to L m (y, f) = L(y, f) + η h m (x)) but here η = 1 & h m (x) = deterministic 33

Connection to Boosting AdaBoost (Freund & Shapire 1996): L(y, f) = exp( y ( f), y { 1, 1} M ) ˆF (x) = sign m=1 α m f(x; p m ) {α m } M 1 = sequential partial reg. coeff s Gradient Boosting (MART Friedman 2001): general y & L(y, f), α m = shrunk (η << 1) ˆF (x) = M m=1 α m f(x; p m ) 34

Potential improvements (ISLE): (1) ˆF (x) = M m=1 ĉm f(x; ˆp m ) {ˆp m } M 1 seq. sampling on ˆq(x) {ĉ m } M 1 regularized linear regression (2) and/or hybrid with random ˆq m (z) (speed) (sample K from N without replacement) 35

ISLE Paradigm Wide variety of ISLE methods: (1) base learner f(x; p); (2) loss criterion L(y, f) (3) perturbation method (4) degree of perturbation: σ of r(p) (5) iid vs. sequential (6) hybrids Examine several options. 36

Monte Carlo Study 100 data sets: each N = 10000, n = 40 {y il = F l (x i ) + ε il } 10000 i=1, l = 1, 100 {F l (x)} 100 1 = different (random) target fun s. x i N(0, I 40 ), ε il N(0, V ar x (F l (x))) signal/noise = 1/1 37

Evaluation Criteria Relative RMS error: rmse( ˆF jl ) = [1 R 2 (F l, ˆF jl )] 1/2 Comparative RMS error: cmse = rmse( ˆF jl )/ min k {rmse( ˆF kl )} (adjusts for problem difficulty) j, k {respective methods} 10000 indep. obs. 38

Properties of F l (x) (1) 30 noise variables (2) wide variety of fun s (difficulty) (3) emphasize lower order interactions (3) not in span of base learners Decision Trees 39

Bagging Relative RMS Error Relative RMS error 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Bag Bag_P Bag_.05 Bag_.05_P Bag_6 Bag_6_P Bag_6_.05_P 40

Bagging Comparative RMS Error Comparative RMS error 1.0 1.2 1.4 1.6 1.8 2.0 Bag Bag_P Bag_.05 Bag_.05_P Bag_6 Bag_6_P Bag_6_.05_P 41

Random Forests Comparative RMS error Comparative RMS error 1.0 1.5 2.0 2.5 RF RF_P RF_.05 RF_.05_P RF_6 RF_6_P RF_6_.05_P 42

Bag/RF Comparative RMS Error Comparative RMS error 1.0 1.2 1.4 1.6 Bag RF Bag_6_05_P RF_6_05_P 43

Sequential Sampling Comparative RMS Error Comparative RMS error 1.0 1.1 1.2 1.3 Mart Mart_P Mart_10_P Mart_.01_10_P Mart_.01_20_P 44

Seq/Bag/RF Comparative RMS Error 1.0 1.2 1.4 1.6 1.8 2.0 2.2 Mart Mart_P Seq_.01_.2 Bag Bag_6_.05_p RF RF_6_.05_p 45

SUMMARY: Theory unify: (1) Bagging, (2) random forests, (3) Bayesian model averaging, (4) boosting single paradigm Monte Carlo integration (1) (3) : iid Monte Carlo, p r(p) (1), (2) perturb. sampling; (3) MCMC (4) quasi Monte Carlo: approx. seq. sampling 46

Practice: {w i } M 1 lasso linear regression (1) improves accuracy of RF bagging (faster) (2) combined with aggressive subsampling & weaker base learners, improves speed: bagging & RF > 10 2, MART 5 allowing much bigger data sets. Also, prediction many times faster. 47

FUTURE DIRECTIONS (1) More thorough understanding (theory) specific recommendations 48

(2) Multiple learning ensembles (MISLES) F (x) = K k=1 P k a k (p k ) f k (x, p k ) dp k { f k (x, p k )} K 1 = different (comp.) base learners F (x) = c km f k (x, p k ) {c km } combined lasso regression Example: f 1 = decision trees f 2 = {x j } n 1 (no sampling) 49

SLIDES http://www-stat.stanford.edu/ jhf/isletalk.pdf REFERENCES HTF (2001): Hastie, Tibshirani, & Friedman, Elements of Statistical Learning, Springer. EHT (2002): Efron, Hastie, & Tibshirani, Least angle regression (Stanford preprint) 50