Variable importance in RF. 1 Start. p < Conditional variable importance in RF. 2 n = 15 y = (0.4, 0.6) Other variable importance measures

Size: px

Start display at page:

Download "Variable importance in RF. 1 Start. p < Conditional variable importance in RF. 2 n = 15 y = (0.4, 0.6) Other variable importance measures"

Dorthy Barton
5 years ago
Views:

1 n = y = (.,.) n = 8 y = (.,.89) n = 8 > 8 n = y = (.88,.8) > > n = 9 y = (.8,.) n = > > > > n = n = 9 y = (.,.) y = (.,.889) > 8 > 8 n = y = (.,.8) n = n = 8 y = (.889,.) > 8 n = y = (.88,.8) n = y = (.8,.) > 9 n = n = n = y = (.8,.9) 8 > 8 n = 8 > 8 n = n = y = (.8,.) n = y = (.,.) 8 > 8 n = y = (.,.) > n = y = (.8,.) 8 > 8 > 8 n = n = y = (.8,.) > n = > 9 n = y = (.,.) > n = y = (.88,.8) > n = y = (.88,.8) > n = y = (.,.) n = y = (.88,.8) > > n = n = n = n = y = (.9,.) > 8 > 8 n = y = (.,.) 8 > 8 n = 9 y = (.,.) > n = y = (.,.) > > n = n = y = (.,.8) y = (.,.8) > n = 8 y = (.,.9) 8 > 8 n = y = (.,.8) > n = y = (.9,.9) n = 8 n = 8 n = 9 > > > 9 n = y = (.9,.8) n = 8 y = (.,.) n = n = 8 y = (.,.) n = y = (.9,.) n = y = (.8,.) n = 8 y = (.,.) > n = > n = y = (.,.) > > > n = y = (.88,.8) > > > n = 8 > 8 n = 8 y = (.8,.) n = y = (.,.) > n = y = (.8,.) > n = n = y = (.8,.9) > n = n = y = (.,.) n = y = (.9,.) Measuring in random forests A Comparison of Different Importance Measures Carolin Strobl (LMU München) Other Other Wien, Jänner 9 Measuring in random forests Measuring in random forests Other Gini mean Gini gain produced by X j over all trees (can be severely biased due to estimation bias and mutiple testing; Strobl et al., ) Other

2 Measuring in random forests The permutation within each tree t Gini mean Gini gain produced by X j over all trees (can be severely biased due to estimation bias and mutiple testing; Strobl et al., ) permutation mean decrease in classification accuracy after Other VI (t) (x j ) = ŷ (t) i ) (y i B (t) I i = ŷ (t) B (t) i ) (y i B (t) I i = ŷ (t) i,π j B (t) = f (t) (x i ) = predicted class before permuting Other permuting X j over all trees (unbiased when subsampling is used; Strobl et al., ) ŷ (t) i,π j = f (t) (x i,πj ) = predicted class after permuting X j x i,πj = (x i,,..., x i,j, x πj (i),j, x i,j+,..., x i,p ) Note: VI (t) (x j ) = by definition, if X j is not in tree t The permutation What kind of independence corresponds to this kind of permutation? over all trees: VI (x j ) = ntree t= VI (t) (x j ) ntree Other obs Y X j Z y x πj (),j z.... i y i x πj (i),j z i.... Other n y n x πj (n),j z n H : X j Y, Z or X j Y X j Z P(Y, X j, Z) H = P(Y, Z) P(X j )

3 What kind of independence corresponds to What kind of independence corresponds to this kind of permutation? this kind of permutation? the original permutation scheme reflects independence of X j from both Y and the remaining predictor s Z Other the original permutation scheme reflects independence of X j from both Y and the remaining predictor s Z Other a high can result from violation of either one! Suggestion: permutation scheme Technically obs Y X j Z y x πj Z=a (),j z = a y x πj Z=a (),j z = a use any partition of the feature space for conditioning y x πj Z=a (),j z = a y x πj Z=b (),j z = b y x πj Z=b (),j z = b y x πj Z=b (),j z = b.... Other Other H : X j Y Z P(Y, X j Z) or P(Y X j, Z) H = P(Y Z) P(Xj Z) H = P(Y Z)

4 Technically Toy example spurious correlation between shoe size and reading skills in school-children use any partition of the feature space for conditioning here: use binary partition already learned by tree for each tree Other > mycf <- cforest(score ~., data = readingskills, + control = cforest_unbiased(mtry = )) Other determine s to condition on (via threshold) extract their cutpoints generate partition using cutpoints as bisectors > varimp(mycf) nativespeaker age shoesize > varimp(mycf, conditional = TRUE) Strobl et al. (8) nativespeaker age shoesize from party.9-99 Simulation results Peptide-binding data mtry = mtry = mtry = 8 8 Other unconditional conditional.. * hy8 flex8 pol Other 8 9

5 Other Other partial correlation, standardized beta conditional effect of X j given all other s in the model random forest permutation averaging over trees averaging over orderings for linear models (relaimpo, Grömping, ) Other unconditional varimp (randomforest, party, Breiman et al., ; Hothorn et al., 8) Other LMG Lindeman, Merenda, and Gold (98), conditional varimp (party, Hothorn et al., 89) dominance analysis Azen and Budescu () PMVD Feldman () for GLMs (hier.part, Walsh and Nally, 8) hierarchical partitioning Chevan and Sutherland (99) elastic net (elasticnet, caret, Zou and Hastie, 8; Kuhn, 8) grouping property: correlated predictors get similar (largest) score R decomposition Desirable (?) properties Desirable (?) properties proper decomposition: scores sum up to model R proper decomposition: scores sum up to model R non-negativity exclusion: β j = score = Other LMG, PMVD non-negativity exclusion: β j = score = Other inclusion: β j score inclusion: β j score Grömping () Grömping ()

6 Desirable (?) properties Desirable (?) properties proper decomposition: scores sum up to model R proper decomposition: scores sum up to model R LMG, PMVD non-negativity LMG, PMVD, RF varimp (in principle) exclusion: β j = score = Other LMG, PMVD non-negativity LMG, PMVD, RF varimp (in principle) exclusion: β j = score = partial correlation, standardized betas, PMVD, RF conditional varimp (in principle), elasticnet? Other inclusion: β j score inclusion: β j score Grömping () Grömping () Desirable (?) properties Simulation study proper decomposition: scores sum up to model R LMG, PMVD non-negativity LMG, PMVD, RF varimp (in principle) exclusion: β j = score = partial correlation, standardized betas, PMVD, RF conditional varimp (in principle), elasticnet? inclusion: β j score all Other dgp: y i = β x i, + + β x i, + ε i, ε i i.i.d. N(, ) X,..., X N(, Σ) Σ = Other Grömping () X j X X X X X X X X 8 X 9 X X X β j

7 Linear model Linear model LiMo LiMo (standardized) coefficient 8 Other R Other 8 LMG LMG LMG LMG mtry =.... Other R Other 8

8 PMVD PMVD PMVD PMVD mtry =..... Other R Other 8 RF unconditional RF unconditional RF mtry = RF mtry = Other Other

9 RF unconditional RF unconditional RF mtry = 8 RF mtry = 8 Other 8 Other RF unconditional RF unconditional RF mtry = RF mtry = R Other R Other 8 8

10 RF unconditional RF unconditional RF mtry = 8 RF mtry = R Other R Other 8 8 RF conditional RF conditional RF conditional mtry = RF conditional mtry = 8 Other Other

11 RF conditional RF conditional RF conditional mtry = 8 RF conditional mtry = Other Other RF conditional RF conditional RF conditional mtry = RF conditional mtry = R Other R Other 8 8

12 RF conditional RF conditional RF conditional mtry = 8 RF conditional mtry = R Other R Other 8 8 Elastic net Elastic net enet elastic net (standardized) coefficient Other R Other 8

13 Now wait a second... Elastic net elastic net lambda = what about elastic net s grouping property? Other Standardized Coefficients Other fraction Elastic net Elastic net elastic net lambda = elastic net lambda =. Standardized Coefficients Other Standardized Coefficients Other fraction fraction

14 Elastic net elastic net lambda = Standardized Coefficients Other Other fraction w.r.t. prediction accuracy: following the exclusion principle rule Other w.r.t. prediction accuracy: following the exclusion principle rule standardized betas, PMVD (not quite), RF conditional (especially with large mtry) and elastic net (tuned!) Other

I w.r.t. prediction accuracy: following the exclusion principle rule standardized betas, PMVD (not quite), RF conditional (especially with large mtry) and elastic net (tuned!

15 I w.r.t. prediction accuracy: following the exclusion principle rule standardized betas, PMVD (not quite), RF conditional (especially with large mtry) and elastic net (tuned!) I I following the exclusion principle rule Other standardized betas, PMVD (not quite), RF conditional (especially with large mtry) and elastic net (tuned!) RF: not limited to linear model, interactions included, w.r.t. prediction accuracy: I applicable even if p > Other RF: not limited to linear model, interactions included, applicable even if p > I if you want elastic net to group: don t tune!? Azen, R. and D. V. Budescu (). The dominance analysis approach for comparing predictors in multiple regression. Psychological Methods 8 (), 9 8. Breiman, L., A. Cutler, A. Liaw, and M. Wiener (). Breiman and Cutler s Random Forests for Classification and Regression. Other R package version.-. Other Chevan, A. and M. Sutherland (99). Hierarchical partitioning. The American Statistician (), 9 9. Feldman, B. (). Relative and value. Technical report. Gro mping, U. (). relaimpo: Relative Importance of Regressors in Linear Models. R package version.. Gro mping, U. (). Estimators of relative for linear regression based on variance decomposition. The American Statistician (), 9.

16 Kuhn, M. (8). caret: Classification and Regression Training. R package version.. Lindeman, R., P. Merenda, and R. Gold (98). Introduction to Bivariate and Multivariate Analysis. Glenview: Scott Foresman & Co. Strobl, C., A.-L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis (8). for random forests. BMC Bioinformatics 9:. Strobl, C., A.-L. Boulesteix, A. Zeileis, and T. Hothorn (). Bias in random forest : Illustrations, sources and a solution. BMC Bioinformatics 8:. Other Walsh, C. and R. M. Nally (8). hier.part: Hierarchical Partitioning. R package version.-. Zou, H. and T. Hastie (8). elasticnet: Elastic-Net for Sparse Estimation and Sparse PCA. R package version.-.

Conditional variable importance in R package extendedforest

Conditional variable importance in R package extendedforest Stephen J. Smith, Nick Ellis, C. Roland Pitcher February 10, 2011 Contents 1 Introduction 1 2 Methods 2 2.1 Conditional permutation................................