Unbiased Recursive Partitioning I: A Non-parametric Conditional Inference Framework

Size: px

Start display at page:

Download "Unbiased Recursive Partitioning I: A Non-parametric Conditional Inference Framework"

Ann Wells
5 years ago
Views:

1 Unbiased Recursive Partitioning I: A Non-parametric Conditional Inference Framework Torsten Hothorn, Kurt Hornik 2 and Achim Zeileis 2 Friedrich-Alexander-Universität Erlangen-Nürnberg 2 Wirtschaftsuniversität Wien

2 Recursive Partitioning Morgan and Sonquist (963): Automated Interaction Detection. Many variants have been (and still are) published, the majority of which are special cases of a simple two-stage algorithm: Step : partition the observations by univariate splits in a recursive way Step 2: fit a constant model in each cell of the resulting partition. Most prominent representatives: CART (Breiman et al., 984) and C4.5 (Quinlan, 993), both implementing an exhaustive search.

3 Two Fundamental Problems Overfitting: Mingers (987) notes that the algorithm [... ] has no concept of statistical significance, and so cannot distinguish between a significant and an insignificant improvement in the information measure. Selection Bias: Exhaustive search methods suffer from a severe selection bias towards covariates with many possible splits and / or missing values.

4 A Statistical Approach We enter at the point where White and Liu (994) demand for [... ] a statistical approach [to recursive partitioning] which takes into account the distributional properties of the measures. and present a unified framework embedding recursive binary partitioning into the well defined theory of Part I: permutation tests developed by Strasser and Weber (999), Part II: tests for parameter instability in (parametric) regression models.

5 The Regression Problem The distribution of a (possibly multivariate) response Y Y is to be regressed on a m-dimensional covariate vector X = (X,..., X m ) X = X X m : D(Y X) = D(Y X,..., X m ) = D(Y f(x,..., X m )), based on a learning sample of n observations L n = {(Y i, X i,..., X mi ); i =,..., n}. possibly with case counts w = (w,..., w n ). We are interested in estimating f.

6 A Generic Algorithm. For case weights w test the global null hypothesis of independence between any of the m covariates and the response. Stop if this hypothesis cannot be rejected. Otherwise select the covariate X j with strongest association to Y. 2. Choose a set A X j in order to split X j into two disjoint sets A and X j \ A. The case weights w left and w right determine the two subgroups with w left,i = w i I(X j i A ) and w right,i = w i I(X j i A ) for all i =,..., n (I( ) denotes the indicator function). 3. Recursively repeat steps and 2 with modified case weights w left and w right, respectively.

7 Recursive Partitioning by Conditional Inference In each node identified by case weights w, the global hypothesis of independence is formulated in terms of the m partial hypotheses H j : D(Y X j) = D(Y) with global null hypothesis H = m j= Hj. Stop recursion when H can not be rejected at a pre-specified level α. Otherwise: Measure the association between Y and each of the covariates X j, j =,..., m, by test statistics or P -values indicating the deviation from the partial hypotheses H j.

8 Use a (multivariate) linear statistic Linear Statistics ( n ) T j (L n, w) = vec w i g j (X ji )h(y i, (Y,..., Y n )) i= R p jq with g j a transformation of X j and influence function h : Y Y n R q. This type of statistics was suggested by Strasser and Weber (999).

9 Conditional Expectation and Covariance under H j (( n ) ) µ j = E(T j (L n, w) S(L n, w)) = vec w i g j (X ji ) E(h S(L n, w)), Σ j = V(T j (L n, w) S(L n, w)) = w w V(h S(L n, w)) i= ( ) w i g j (X ji ) w i g j (X ji ) i ( ) ( w V(h S(L n, w)) w i g j (X ji ) w i g j (X ji ) with w = n i= w i. i i )

10 Test Statistics A (multivariate) linear statistic T j can now be used to construct a test statistic for testing H j, for example via or c max (t, µ, Σ) = max k=,...,pq (t µ) k (Σ)kk c quad (t, µ, Σ) = (t µ)σ + (t µ)

11 Variable Selection and Stopping Criteria Test H based on P,..., P m, P j = P j H (c(t j (L n, w), µ j, Σ j ) c(t j, µ j, Σ j ) S(L n, w)) conditional on all permutations S(L n, w) of the data. problem. This solves the overfitting When we can reject H in step of the generic algorithm we select the covariate with minimum P -value P j = P j H (c(t j (L n, w), µ j, Σ j ) c(t j, µ j, Σ j ) S(L n, w)) of the conditional test for H j. This prevents a variable selection bias.

12 Splitting Criteria The goodness of a split is evaluated by two-sample linear statistics which are special cases of the linear statistic T. For all possible subsets A of the sample space X j the linear statistic ( n ) T A j (L n, w) = vec w i I(X j i A)h(Y i, (Y,..., Y n )) i= induces a two-sample statistic and we implement the best split A = argmax A c(t A j, µa j, ΣA j ).

13 Examples: Tree Pipit Abundance The occurrence of tree pipits was recorded several times at n = 86 stands and the impact of nine environmental factors on the abundance is of special interest here. R> library("party") R> data("treepipit", package = "coin") R> tptree <- ctree(counts ~., data = treepipit) R> plot(tptree, terminal_panel = node_hist(tptree, + breaks = :6 -.5, ymax = 65, horizontal = FALSE, + freq = TRUE))

14 coverstorey p =.2 4 > 4 6 Node 2 (n = 24) 6 Node 3 (n = 62)

15 Examples: Glaucoma & Laser Scanning Images Laser scanning images taken from the eye background are expected to serve as the basis of an automated system for glaucoma diagnosis. For 98 patients and 98 controls, matched by age and gender, 62 covariates describing the eye morphology are available. R> data("glaucomam", package = "ipred") R> gtree <- ctree(class ~., data = GlaucomaM) R> plot(gtree)

16 vari p <. 2 vasg p <..59 >.59 3 vart p =..46 >.46.5 >.5 Node 4 (n = 5) Node 5 (n = 22) Node 6 (n = 4) Node 7 (n = 9) glaucoma normal glaucoma normal glaucoma normal glaucoma normal

17 Examples: Glaucoma & Laser Scanning Images Interested in the class distribution in each inner node? Want to explore the process of the split statistics in each inner node? R> plot(gtree, inner_panel = node_barplot)

18 vari (p <.) Node (n = 96) glaucoma normal.59 > vasg (p <.) Node 2 (n = 87) glaucoma normal.46 > vart (p =.) Node 3 (n = 73) glaucoma normal.5 > Node 4 (n = 5) Node 5 (n = 22) Node 6 (n = 4) Node 7 (n = 9) glaucoma normal glaucoma normal glaucoma normal glaucoma normal

19 Examples: Glaucoma & Laser Scanning Images R> cex <-.6 R> inner <- nodes(gtree, :3) R> layout(matrix(:length(inner), ncol = length(inner))) R> out <- sapply(inner, function(i) { + splitstat <- i$psplit$splitstatistic + x <- GlaucomaM[[i$psplit$variableName]][splitstat > + ] + plot(x, splitstat[splitstat > ], main = paste("node", + i$nodeid), xlab = i$psplit$variablename, + ylab = "Statistic", ylim = c(, ), + cex.axis = cex, cex.lab = cex, cex.main = cex) + abline(v = i$psplit$splitpoint, lty = 3) + })

20 Node vari Statistic Node 2 vasg Statistic Node 3 vart Statistic

21 Examples: Node Positive Breast Cancer Evaluation of prognostic factors for the German Breast Cancer Study Group (GBSG2) data, a prospective controlled clinical trial on the treatment of node positive breast cancer patients. Complete data of seven prognostic factors of 686 women are used for prognostic modeling. R> data("gbsg2", package = "ipred") R> stree <- ctree(surv(time, cens) ~., data = GBSG2) R> plot(stree)

22 pnodes p <. 3 > 3 2 horth p =.35 5 progrec p <. no yes 2 > 2 Node 3 (n = 248) Node 4 (n = 28) Node 6 (n = 44) Node 7 (n = 66)

23 Examples: Mammography Experience Ordinal response variables are common in investigations where the response is a subjective human interpretation. We use an example given by Hosmer and Lemeshow (2), p. 264, studying the relationship between the mammography experience (never, within a year, over one year) and opinions about mammography expressed in questionnaires answered by n = 42 women. R> data("mammoexp", package = "party") R> mtree <- ctree(me ~., data = mammoexp, scores = list(me = :3, + SYMPT = :4, DECT = :3)) R> plot(mtree)

24 SYMPT p <. Agree > Agree 3 PB p =.2 8 > 8 Node 2 (n = 3) Node 4 (n = 28) Node 5 (n = 9) Never Within a YearOver a Year Never Within a YearOver a Year Never Within a YearOver a Year

25 Benchmark Experiments Hypothesis : Conditional inference trees with statistical stop criterion perform as good as an exhaustive search algorithm with pruning. Hypothesis 2: Conditional inference trees with statistical stop criterion perform as good as parametric unbiased recursive partitioning (QUEST,GUIDE, Loh, 22, is a starting point). Equivalence measured by ratio of misclassification or mean squared error with a equivalence margin of ±% and Fieller confidence intervals.

26 ctree vs. rpart Boston Housing Breast Cancer Diabetes.735 (.7,.76).955 (.93,.979) *.2 (.3,.28) * Glass.95 (.8,.8) Glaucoma Ionosphere Ozone Servo Sonar.43 (.23,.64) *.93 (.892,.934).834 (.84,.855).962 (.949,.975) *.96 (.944,.976) * Soybean.53 (.2,.85) * Vehicle.997 (.989,.4) * Vowel.99 (.92,.95) * Performance ratio

27 ctree vs. QUEST / GUIDE Boston Housing.9 (.886,.95) Breast Cancer Diabetes Glass.945 (.923,.967) *.999 (.99,.7) *.2 (.9,.32) * Glaucoma.6 (.987,.24) * Ionosphere Ozone Servo Sonar.5 (.489,.53).932 (.92,.944) *.958 (.939,.978) *.955 (.94,.97) * Soybean.495 (.47,.59) Vehicle Vowel.5 (.4,.58) *.86 (.76,.96) * Performance ratio

28 Summary The separation of variable selection and split point estimation first implemented in CHAID (Kass, 98) is the basis for unbiased recursive partitioning for responses and covariates measured at arbitrary scales. The statistical internal stop criterion ensures that interpretations drawn from such trees are valid in a statistical sense, i.e., with appropriate control of type I errors. Even the algorithm has no concept of prediction error, the performance is at least equivalent to established procedures. We are committed to reproducible research, see R> vignette(package = "party")

29 References L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and regression trees. Wadsworth, California, 984. David W. Hosmer and Stanley Lemeshow. Applied Logistic Regression. John Wiley & Sons, New York, 2nd edition, 2. G.V. Kass. An exploratory technique for investigating large quantities of categorical data. Applied Statistics, 29(2):9 27, 98. Wei-Yin Loh. Regression trees with unbiased variable selection and interaction detection. Statistica Sinica, 2:36 386, 22. John Mingers. Expert systems rule induction with statistical data. Journal of the Operations Research Society, 38():39 47, 987. James N. Morgan and John A. Sonquist. Problems in the analysis of survey data, and a proposal. Journal of the American Statistical Association, 58:45 434, 963. John R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publ., San Mateo, California, 993. Helmut Strasser and Christian Weber. On the asymptotic theory of permutation statistics. Mathematical Methods of Statistics, 8:22 25, 999. Allan P. White and Wei Zhong Liu. Bias in information-based measures in decision tree induction. Machine Learning, 5:32 329, 994.

The Design and Analysis of Benchmark Experiments Part II: Analysis

The Design and Analysis of Benchmark Experiments Part II: Analysis Torsten Hothorn Achim Zeileis Friedrich Leisch Kurt Hornik Friedrich Alexander Universität Erlangen Nürnberg http://www.imbe.med.uni-erlangen.de/~hothorn/