Conditional variable importance in R package extendedforest

Similar documents
Variable importance in RF. 1 Start. p < Conditional variable importance in RF. 2 n = 15 y = (0.4, 0.6) Other variable importance measures

Example analysis of biodiversity survey data with R package gradientforest

Advanced Statistical Methods: Beyond Linear Regression

Supplementary material for Intervention in prediction measure: a new approach to assessing variable importance for random forests

Classification using stochastic ensembles

Variance Reduction and Ensemble Methods

Variable importance measures in regression and classification methods

Analysis and correction of bias in Total Decrease in Node Impurity measures for tree-based algorithms

arxiv: v5 [stat.me] 18 Apr 2016

Random Forests for Ordinal Response Data: Prediction and Variable Selection

SF2930 Regression Analysis

The gpca Package for Identifying Batch Effects in High-Throughput Genomic Data

Data analysis strategies for high dimensional social science data M3 Conference May 2013

Generalized Boosted Models: A guide to the gbm package

Variable Selection and Weighting by Nearest Neighbor Ensembles

Statistics and learning: Big Data

Variable importance in binary regression trees and forests

An experimental study of the intrinsic stability of random forest variable importance measures

A Bias Correction for the Minimum Error Rate in Cross-validation

BAGGING PREDICTORS AND RANDOM FOREST

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Ensemble Methods and Random Forests

OVER 1.2 billion people do not have access to electricity

Wiley. Methods and Applications of Linear Models. Regression and the Analysis. of Variance. Third Edition. Ishpeming, Michigan RONALD R.

Machine Learning - TP

An Analysis. Jane Doe Department of Biostatistics Vanderbilt University School of Medicine. March 19, Descriptive Statistics 1

samplesizelogisticcasecontrol Package

Experimental Design and Data Analysis for Biologists

International Journal of Pure and Applied Mathematics Volume 19 No , A NOTE ON BETWEEN-GROUP PCA

Titanic: Data Analysis

Business Statistics. Lecture 10: Correlation and Linear Regression

REVIEW 8/2/2017 陈芳华东师大英语系

Multivariate Analysis of Ecological Data using CANOCO

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Evaluation. Andrea Passerini Machine Learning. Evaluation

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

8. Classification and Regression Trees (CART, MRT & RF)

REGRESSION TREE CREDIBILITY MODEL

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Evaluation requires to define performance measures to be optimized

Linear Regression In God we trust, all others bring data. William Edwards Deming

Subject CS1 Actuarial Statistics 1 Core Principles

The Model Building Process Part I: Checking Model Assumptions Best Practice

ECE521 lecture 4: 19 January Optimization, MLE, regularization

The Model Building Process Part I: Checking Model Assumptions Best Practice (Version 1.1)

Package EMVS. April 24, 2018

Day 3: Classification, logistic regression

Dynamic Time Warping and FFT: A Data Preprocessing Method for Electrical Load Forecasting

Statistics Toolbox 6. Apply statistical algorithms and probability models

Assessing Relative Importance Using RSP Scoring to Generate Variable Importance Factor (VIF)

The prediction of house price

Package reinforcedpred

Ensemble Methods: Jay Hyer

Course in Data Science

Stat 5101 Lecture Notes

Individual Treatment Effect Prediction Using Model-Based Random Forests

Correlation: Relationships between Variables

Statistical Machine Learning from Data

Adaptive Crowdsourcing via EM with Prior

Statistics: A review. Why statistics?

Holdout and Cross-Validation Methods Overfitting Avoidance

arxiv: v1 [math.st] 14 Mar 2016

An Introduction to Multivariate Statistical Analysis

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Spatial autocorrelation: robustness of measures and tests

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Roman Hornung. Ordinal Forests. Technical Report Number 212, 2017 Department of Statistics University of Munich.

ECON 3150/4150, Spring term Lecture 7

Multivariate Fundamentals: Rotation. Exploratory Factor Analysis

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

Full versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation

Hypothesis Testing for Var-Cov Components

Introduction to Structural Equation Modeling

STA 4273H: Statistical Machine Learning

MACHINE LEARNING ADVANCED MACHINE LEARNING

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Decision Tree Learning

Contents. 1 Review of Residuals. 2 Detecting Outliers. 3 Influential Observations. 4 Multicollinearity and its Effects

ABC random forest for parameter estimation. Jean-Michel Marin

Nested Effects Models at Work

Package msir. R topics documented: April 7, Type Package Version Date Title Model-Based Sliced Inverse Regression

WALD LECTURE II LOOKING INSIDE THE BLACK BOX. Leo Breiman UCB Statistics

Managing Uncertainty

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

CS7267 MACHINE LEARNING

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

BIOL 458 BIOMETRY Lab 9 - Correlation and Bivariate Regression

the tree till a class assignment is reached

SUPPLEMENTARY INFORMATION

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

Regression and Classification Trees

Master s Written Examination - Solution

Lecture 13: Ensemble Methods

STAT440/840: Statistical Computing

day month year documentname/initials 1

Statistical aspects of prediction models with high-dimensional data

Machine Learning Linear Regression. Prof. Matteo Matteucci

To Tune or Not to Tune the Number of Trees in Random Forest

Transcription:

Conditional variable importance in R package extendedforest Stephen J. Smith, Nick Ellis, C. Roland Pitcher February 10, 2011 Contents 1 Introduction 1 2 Methods 2 2.1 Conditional permutation................................ 2 2.2 Simulation Study.................................... 2 3 Results 3 4 Session information 4 References 5 1 Introduction The gradientforest package was developed to analyse large numbers of potential predictor variables by integrating the individual results from random forest analyses over a number of species. The random forests for each species were produced by the R package extendedforest consisting of modifications that we made to the original randomforest package [Liaw and Wiener, 2002]. One of the major modifications made to randomforest was to the method for calculating variable importance when two or more predictor variables were correlated. Many of the predictor variables used in ecological studies are either naturally (e.g., decreasing temperatures with water depth) or functionally (e.g., benthic irradiance are calculated as a function of bottom depth and light attenuation) correlated. While some of these predictors may determine species distribution or abundance other collinear predictors may not. The random subset approach for fitting predictor variables at each node could result in a correlated but less influential predictor standing in for more highly influential predictors in the early splits of an individual tree depending upon which predictor is selected in the subset. This tendency can be lessened by increasing the subsample size of predictors for each node but the trade-off would be an increase in correlation between trees in the forest with concurrent increase in generalization error and a decrease in accuracy [Breiman, 2001; see also Grömping, 2009]. Strobl et al. [2008] have also demonstrated that the permutation method for estimating variable importance exhibits a bias towards correlated predictor variables. The underlying reason for this behaviour has to do with the structure of the null hypothesis, i.e., independence between the response Y and the predictor X j being permuted, implied by the importance measure. A small value for the importance measure would suggest that Y and X j are independent but also assumes that X j is independent of the other predictor variables Z in the model that were not permuted (Z = X,..., X j 1, X j+1,..., X p ). Correlation between X j and Z will result in an 1

apparent increase in importance reflecting the lack of independence between X j and Z instead of only reflecting the lack of independence between X j and Y. To remedy this situation, Strobl et al. [2008] proposed a conditional permutation approach where the values for X j in the OOB sample for each tree are permuted within partitions of the values of the predictors in each tree that are correlated with X j. Permutation importance is calculated by passing the OOB samples reconfigured with this permutation grid through each respective tree in the standard way. In this document we present the results of a simulation study demonstrating the impact of correlation between predictor variables on determining variable importance and how the conditional permutation method implemented in extendedforest reduces this impact. 2 Methods 2.1 Conditional permutation Our implementation of the method of Strobl et al. [2008] in extendedforest is based on the following. For predictor X j determine all predictors, X i (i = 1,..., k; i j and k p 1) that are correlated with X j above some threshold ρ. For tree t, find the first K split values, s 1,..., s K and indices v i,..., v K on the X i. For each observation l in the OOB sample designate a grouping or partitioning label as, g l = K 2 i 1 I(X v i < s i ) (1) i=1 where I( ) is the indicator function taking value 1 if its argument is true and 0 otherwise. Permutation of X j in the OOB sample is applied within the above groups and calculation of the permutation importance measure proceeds as before. The grouping labels will take at most 2 K different values, although some combinations may be missing. K should be chosen not so large that there are too few observations per partition. We used a rule-of-thumb K = log 2 (0.368Na sites /2), which, if the sites were uniformly distributed among partitions, would ensure at least two points per partition. 2.2 Simulation Study We used the simulation study design in Strobl et al. [2008] to demonstrate the difference between determining variable importance by conditional and marginal permutation. The response variable was set as a function of twelve predictor variables, i.e., y i = β 1 x i,1 + + β 12 x i,12 + ε i, where ε i N(0, 0.5). The coefficients for the predictor variables were set so that only six of the twelve were influential. Table 1: Regression coefficients for linear model used in simulation. Predictor variables x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 x 10 x 11 x 12 β j 5 5 2 0-5 -5-2 0 0 0 0 0 The correlation structure was introduced by setting the predictor variables as a sample from a multivariate normal distribution with a zero mean vector and covariance Σ. All predictors were

defined to have unit variance, σ j,j = 1 and only the first four predictors were block-correlated with σ j,j = 0.9 (i j 4). Off-diagonal elements were set to zero for the rest of the predictors. The R code used to run the simulation follows. > require(extendedforest) > require(mass) > Cov <- matrix(0, 12, 12) > Cov[1:4, 1:4] <- 0.9 > diag(cov)[] <- 1 > beta <- c(5, 5, 2, 0, -5, -5, -2, 0, 0, 0, 0, + 0) > maxk <- c(0, 2, 4, 6) > nsites <- 100 > nsim <- 100 > imp <- array(0, dim = c(12, 4, nsim)) > set.seed(222) > for (sim in 1:nsim) { + X <- mvrnorm(nsites, rep(0, 12), Sigma = Cov) + Y <- X %*% beta + rnorm(nsites, 0, 0.5) + df <- cbind(y = Y, as.data.frame(x)) + for (lev in 1:4) { + RF <- randomforest(y ~., df, maxlevel = maxk[lev], + importance = TRUE, ntree = 500, corr.threshold = 0.5, + mtry = 8) + imp[, lev, sim] <- RF$importance[, 1] + } + } > dimnames(imp) <- list(rownames(rf$importance), + as.character(maxk), NULL) > imp <- as.data.frame.table(imp) 3 Results Marginal permutation identifies variables 1 and 2 as most important, followed by variables 3 to 7. even though variable 4 had no influence on the response, its correlation to variables 1 to 3 resulted in this variable being ranked as more important than variables 5 to 7. > require(lattice) > names(imp) <- c("var", "maxk", "sim", "importance") > print(bwplot(var ~ importance ifelse(maxk == + "0", "Marginal", paste("conditional: Level", + maxk, sep = "=")), imp, as.table = T))

0 50 100 Conditional: Level=2 Conditional: Level=4 V12 V11 V10 V9 V8 V7 V6 V5 V4 V3 V2 V1 Conditional: Level=6 Marginal V12 V11 V10 V9 V8 V7 V6 V5 V4 V3 V2 V1 0 50 100 importance Variable importance obtained for differing number of partitions for the permutation grid are presented. Based on our rule-of-thumb the number of partitions should be set to 4 and there appears to be little difference between the results for K = 4 and K = 6. In both these cases the importance of variables 1 and 2 were very similar to those for variables 5 and 6 as expected from the linear model. Further, variable 4 is now just slightly ahead of variable 7 in importance. Apparently, this approach eliminates most but not all of effects of correlation. It is possible that increasing the number of partitions may reduce the importance of variable 4 even more (compare results from K = 4 and K = 6), however, at some point there will not be enough observations at each partition to calculate importance. 4 Session information The simulation and output in this document were generated in the following computing environment. ˆ R version 2.10.1 (2009-12-14), i386-pc-mingw32 ˆ Locale: LC_COLLATE=English_Australia.1252, LC_CTYPE=English_Australia.1252, LC_MONETARY=English_Australia.1252, LC_NUMERIC=C, LC_TIME=English_Australia.1252 ˆ Base packages: base, datasets, graphics, grdevices, methods, stats, tools, utils ˆ Other packages: extendedforest 1.5, lattice 0.18-3, MASS 7.3-4

ˆ Loaded via a namespace (and not attached): grid 2.10.1 References L. Breiman. Random Forests. Machine Learning, 45(1):5 32, 2001. Ulrike Grömping. Variable importance assessment in regression: linear regression versus random forest. The American Statistician, 63:308 319, 2009. doi: 10.1198/tast.2009.08199. Andy Liaw and Matthew Wiener. Classification and regression by randomforest. R News, 2(3): 18 22, 2002. URL http://cran.r-project.org/doc/rnews/. C. Strobl, A.L. Boulesteix, T. Kneib, T. Augustin, and A. Zeileis. Conditional variable importance for random forests. BMC Bioinformatics, 9(1):307, 2008.