Contents. This is page i Printer: Opaque this

Size: px

Start display at page:

Download "Contents. This is page i Printer: Opaque this"

Anthony Richard
5 years ago
Views:

1 Cntents This is page i Printer: Opaque this 9 Additive Mdels, Trees, and Related Methds Generalized Additive Mdels Fitting Additive Mdels Example: Additive Lgistic Regressin Summary Tree-Based Methds Backgrund Regressin Trees Classificatin Trees Other Issues Spam Example (Cntinued) PRIM: Bump Hunting Spam Example (Cntinued) MARS: Multivariate Adaptive Regressin Splines Spam Example (Cntinued) Example (Simulated Data) Other Issues Hierarchical Mixtures f Experts Missing Data Cmputatinal Cnsideratins Bibligraphic Ntes Exercises References 43

2 This is page i Printer: Opaque this The Elements f Statistical Learning Data Mining, Inference and Predictin Chapter 9: Additive Mdels, Trees, and Related Methds Jerme Friedman Trevr Hastie Rbert Tibshirani August 4, 2008 c Friedman, Hastie & Tibshirani

3 9 Additive Mdels, Trees, and Related Methds This is page 1 Printer: Opaque this In this chapter we begin ur discussin f sme specific methds fr supervised learning. These techniques each assume a (different) structured frm fr the unknwn regressin functin, and by ding s they finesse the curse f dimensinality. Of curse, they pay the pssible price f misspecifying the mdel, and s in each case there is a tradeff that has t be made. They take ff where Chapters 3 6 left ff. We describe five related techniques: generalized additive mdels, trees, multivariate adaptive regressin splines, the patient rule inductin methd, and hierarchical mixtures f experts. 9.1 Generalized Additive Mdels Regressin mdels play an imprtant rle in many data analyses, prviding predictin and classificatin rules, and data analytic tls fr understanding the imprtance f different inputs. Althugh attractively simple, the traditinal linear mdel ften fails in these situatins: in real life, effects are ften nt linear. In earlier chapters we described techniques that used predefined basis functins t achieve nnlinearities. This sectin describes mre autmatic flexible statistical methds that may be used t identify and characterize nnlinear regressin effects. These methds are called generalized additive mdels. In the regressin setting, a generalized additive mdel has the frm E(Y X 1, X 2,..., X p ) = α + f 1 (X 1 ) + f 2 (X 2 ) + + f p (X p ). (9.1)

4 2 9. Additive Mdels, Trees, and Related Methds As usual X 1, X 2,..., X p represent predictrs and Y is the utcme; the f j s are unspecified smth ( nnparametric ) functins. If we were t mdel each functin using an expansin f basis functins (as in Chapter 5), the resulting mdel culd then be fit by simple least squares. Our apprach here is different: we fit each functin using a scatterplt smther (e.g., a cubic smthing spline r kernel smther), and prvide an algrithm fr simultaneusly estimating all p functins (Sectin 9.1.1). Fr tw-class classificatin, recall the lgistic regressin mdel fr binary data discussed in Sectin 4.4. We relate the mean f the binary respnse µ(x) = Pr(Y = 1 X) t the predictrs via a linear regressin mdel and the lgit link functin: ( ) µ(x) lg = α + β 1 X β p X p. (9.2) 1 µ(x) The additive lgistic regressin mdel replaces each linear term by a mre general functinal frm ( ) µ(x) lg = α + f 1 (X 1 ) + + f p (X p ), (9.3) 1 µ(x) where again each f j is an unspecified smth functin. While the nnparametric frm fr the functins f j makes the mdel mre flexible, the additivity is retained and allws us t interpret the mdel in much the same way as befre. The additive lgistic regressin mdel is an example f a generalized additive mdel. In general, the cnditinal mean µ(x) f a respnse Y is related t an additive functin f the predictrs via a link functin g: g[µ(x)] = α + f 1 (X 1 ) + + f p (X p ). (9.4) Examples f classical link functins are the fllwing: g(µ) = µ is the identity link, used fr linear and additive mdels fr Gaussian respnse data. g(µ) = lgit(µ) as abve, r g(µ) = prbit(µ), the prbit link functin, fr mdeling binmial prbabilities. The prbit functin is the inverse Gaussian cumulative distributin functin: prbit(µ) = Φ 1 (µ). g(µ) = lg(µ) fr lg-linear r lg-additive mdels fr Pissn cunt data. All three f these arise frm expnential family sampling mdels, which in additin include the gamma and negative-binmial distributins. These families generate the well-knwn class f generalized linear mdels, which are all extended in the same way t generalized additive mdels. The functins f j are estimated in a flexible manner, using an algrithm whse basic building blck is a scatterplt smther. The estimated functin ˆf j can then reveal pssible nnlinearities in the effect f X j. Nt all

5 9.1 Generalized Additive Mdels 3 f the functins f j need t be nnlinear. We can easily mix in linear and ther parametric frms with the nnlinear terms, a necessity when sme f the inputs are qualitative variables (factrs). The nnlinear terms are nt restricted t main effects either; we can have nnlinear cmpnents in tw r mre variables, r separate curves in X j fr each level f the factr X k. Thus each f the fllwing wuld qualify: g(µ) = X T β + α k + f(z) a semiparametric mdel, where X is a vectr f predictrs t be mdeled linearly, α k the effect fr the kth level f a qualitative input V, and the effect f predictr Z is mdeled nnparametrically. g(µ) = f(x) + g k (Z) again k indexes the levels f a qualitative input V. and thus creates an interactin term g(v, Z) = g k (Z) fr the effect f V and Z. g(µ) = f(x) + g(z, W ) where g is a nnparametric functin in tw features. Additive mdels can replace linear mdels in a wide variety f settings, fr example an additive decmpsitin f time series, Y t = S t + T t + ε t, (9.5) where S t is a seasnal cmpnent, T t is a trend and ε is an errr term Fitting Additive Mdels In this sectin we describe a mdular algrithm fr fitting additive mdels and their generalizatins. The building blck is the scatterplt smther fr fitting nnlinear effects in a flexible way. Fr cncreteness we use as ur scatterplt smther the cubic smthing spline described in Chapter 5. The additive mdel has the frm Y = α + p f j (X j ) + ε, (9.6) j=1 where the errr term ε has mean zer. Given bservatins x i, y i, a criterin like the penalized sum f squares (5.9) f Sectin 5.4 can be specified fr this prblem, ( 2 N p p PRSS(α, f 1, f 2,..., f p ) = y i α f j (x ij )) + λ j i=1 j=1 j=1 f j (t j) 2 dt j, (9.7) where the λ j 0 are tuning parameters. It can be shwn that the minimizer f (9.7) is an additive cubic spline mdel; each f the functins f j is a

6 4 9. Additive Mdels, Trees, and Related Methds Algrithm 9.1 The Backfitting Algrithm fr Additive Mdels. 1. Initialize: ˆα = 1 N N 1 y i, ˆf j 0, i, j. 2. Cycle: j = 1, 2,..., p,..., 1, 2,..., p,..., [ ˆf j S j {y i ˆα ] ˆf k (x ik )} N 1, k j ˆf j ˆf j 1 N N ˆf j (x ij ). i=1 until the functins ˆf j change less than a prespecified threshld. cubic spline in the cmpnent X j, with knts at each f the unique values f x ij, i = 1,..., N. Hwever, withut further restrictins n the mdel, the slutin is nt unique. The cnstant α is nt identifiable, since we can add r subtract any cnstants t each f the functins f j, and adjust α accrdingly. The standard cnventin is t assume that N 1 f j(x ij ) = 0 j the functins average zer ver the data. It is easily seen that ˆα = ave(y i ) in this case. If in additin t this restrictin, the matrix f input values (having ijth entry x ij ) is nnsingular, then (9.7) is a strictly cnvex criterin and the minimizer is unique. If the matrix is singular, then the linear part f the cmpnents f j cannt be uniquely determined (while the nnlinear parts can!)(buja et al. 1989). Furthermre, a simple iterative prcedure exists fr finding the slutin. We set ˆα = ave(y i ), and it never changes. We apply a cubic smthing spline S j t the targets {y i ˆα ˆf k j k (x ik )} N 1, as a functin f x ij, t btain a new estimate ˆf j. This is dne fr each predictr in turn, using the current estimates f the ther functins ˆf ˆf k when cmputing y i ˆα k j k (x ik ). The prcess is cntinued until the estimates ˆf j stabilize. This prcedure, given in detail in Algrithm 9.1, is knwn as backfitting and the resulting fit is analgus t a multiple regressin fr linear mdels. In principle, the secnd step in (2) f Algrithm 9.1 is nt needed, since the smthing spline fit t a mean-zer respnse has mean zer (Exercise 9.1). In practice, machine runding can cause slippage, and the adjustment is advised. This same algrithm can accmmdate ther fitting methds in exactly the same way, by specifying apprpriate smthing peratrs S j : ther univariate regressin smthers such as lcal plynmial regressin and kernel methds;

7 9.1 Generalized Additive Mdels 5 linear regressin peratrs yielding plynmial fits, piecewise cnstant fits, parametric spline fits, series and Furier fits; mre cmplicated peratrs such as surface smthers fr secnd r higher-rder interactins r peridic smthers fr seasnal effects. If we cnsider the peratin f smther S j nly at the training pints, it can be represented by an N N peratr matrix S j (see Sectin 5.4.1). Then the degrees f freedm fr the jth term are (apprximately) cmputed as df j = trace[s j ] 1, by analgy with degrees f freedm fr smthers discussed in Chapters 5 and 6. Fr a large class f linear smthers S j, backfitting is equivalent t a Gauss Seidel algrithm fr slving a certain linear system f equatins. Details are given in Exercise 9.2. Fr the lgistic regressin mdel and ther generalized additive mdels, the apprpriate criterin is a penalized lg-likelihd. T maximize it, the backfitting prcedure is used in cnjunctin with a likelihd maximizer. The usual Newtn Raphsn rutine fr maximizing lg-likelihds in generalized linear mdels can be recast as an IRLS (iteratively reweighted least squares) algrithm. This invlves repeatedly fitting a weighted linear regressin f a wrking respnse variable n the cvariates; each regressin yields a new value f the parameter estimates, which in turn give new wrking respnses and weights, and the prcess is iterated (see Sectin 4.4.1). In the generalized additive mdel, the weighted linear regressin is simply replaced by a weighted backfitting algrithm. We describe the algrithm in mre detail fr lgistic regressin belw, and mre generally in Chapter 6 f Hastie and Tibshirani (1990) Example: Additive Lgistic Regressin Prbably the mst widely used mdel in medical research is the lgistic mdel fr binary data. In this mdel the utcme Y can be cded as 0 r 1, with 1 indicating an event (like death r relapse f a disease) and 0 indicating n event. We wish t mdel Pr(Y = 1 X), the prbability f an event given values f the prgnstic factrs X = (X 1,..., X p ). The gal is usually t understand the rles f the prgnstic factrs, rather than t classify new individuals. Lgistic mdels are als used in applicatins where ne is interested in estimating the class prbabilities, fr use in risk screening. Apart frm medical applicatins, credit risk screening is a ppular applicatin. The generalized additive lgistic mdel has the frm lg Pr(Y = 1 X) Pr(Y = 0 X) = α + f 1(X 1 ) + + f p (X p ). (9.8) The functins f 1, f 2,..., f p are estimated by a backfitting algrithm within a Newtn Raphsn prcedure, shwn in Algrithm 9.2.

8 6 9. Additive Mdels, Trees, and Related Methds Algrithm 9.2 Lcal Scring Algrithm fr the Additive Lgistic Regressin Mdel. 1. Cmpute starting values: ˆα = lg[ȳ/(1 ȳ)], where ȳ = ave(y i ), the sample prprtin f nes, and set ˆf j 0 j. 2. Define ˆη i = ˆα + j ˆf j (x ij ) and ˆp i = 1/[1 + exp( ˆη i )]. Iterate: (a) Cnstruct the wrking target variable z i = ˆη i + (y i ˆp i ) ˆp i (1 ˆp i ). (b) Cnstruct weights w i = ˆp i (1 ˆp i ) (c) Fit an additive mdel t the targets z i with weights w i, using a weighted backfitting algrithm. This gives new estimates ˆα, ˆf j, j 3. Cntinue step 2. until the change in the functins falls belw a prespecified threshld. The additive mdel fitting in step (2) f Algrithm 9.2 requires a weighted scatterplt smther. Mst smthing prcedures can accept bservatin weights (Exercise 5.12); see Chapter 3 f Hastie and Tibshirani (1990) fr further details. The additive lgistic regressin mdel can be generalized further t handle mre than tw classes, using the multilgit frmulatin as utlined in Sectin 4.4. While the frmulatin is a straightfrward extensin f (9.8), the algrithms fr fitting such mdels are mre cmplex. See Yee and Wild (1996) fr details, and the VGAM sftware currently available frm: yee. Example: Predicting Spam We apply a generalized additive mdel t the spam data intrduced in Chapter 1. The data cnsists f infrmatin frm messages, in a study t screen fr spam (i.e., junk ). The data is publicly available at ftp.ics.uci.edu, and was dnated by Gerge Frman frm Hewlett-Packard labratries, Pal Alt, Califrnia. The respnse variable is binary, with values r spam, and there are 57 predictrs as described belw: 48 quantitative predictrs the percentage f wrds in the that match a given wrd. Examples include business, address, internet,

9 9.1 Generalized Additive Mdels 7 TABLE 9.1. Test data cnfusin matrix fr the additive lgistic regressin mdel fit t the spam training data. The verall test errr rate is 5.5%. Predicted Class True Class (0) spam (1) (0) 58.3% 2.5% spam (1) 3.0% 36.3% free, and gerge. The idea was that these culd be custmized fr individual users. 6 quantitative predictrs the percentage f characters in the that match a given character. The characters are ch;, ch(, ch[, ch!, ch$, and ch#. The average length f uninterrupted sequences f capital letters: CAPAVE. The length f the lngest uninterrupted sequence f capital letters: CAPMAX. The sum f the length f uninterrupted sequences f capital letters: CAPTOT. We cded spam as 1 and as zer. A test set f size 1536 was randmly chsen, leaving 3065 bservatins in the training set. A generalized additive mdel was fit, using a cubic smthing spline with a nminal fur degrees f freedm fr each predictr. What this means is that fr each predictr X j, the smthing-spline parameter λ j was chsen s that trace[s j (λ j )] 1 = 4, where S j (λ) is the smthing spline peratr matrix cnstructed using the bserved values x ij, i = 1,..., N. This is a cnvenient way f specifying the amunt f smthing in such a cmplex mdel. Mst f the spam predictrs have a very lng-tailed distributin. Befre fitting the GAM mdel, we lg-transfrmed each variable (actually lg(x + 0.1)), but the plts in Figure 9.1 are shwn as a functin f the riginal variables. The test errr rates are shwn in Table 9.1; the verall errr rate is 5.3%. By cmparisn, a linear lgistic regressin has a test errr rate f 7.6%. Table 9.2 shws the predictrs that are highly significant in the additive mdel. Fr ease f interpretatin, in Table 9.2 the cntributin fr each variable is decmpsed int a linear cmpnent and the remaining nnlinear cmpnent. The tp blck f predictrs are psitively crrelated with spam, while the bttm blck is negatively crrelated. The linear cmpnent is a weighted least squares linear fit f the fitted curve n the predictr, while the nnlinear part is the residual. The linear cmpnent f an estimated

10 8 9. Additive Mdels, Trees, and Related Methds TABLE 9.2. Significant predictrs frm the additive mdel fit t the spam training data. The cefficients represent the linear part f ˆf j, alng with their standard errrs and Z-scre. The nnlinear P-value is fr a test f nnlinearity f ˆf j. Name Num. df Cefficient Std. Errr Z Scre Nnlinear P -value Psitive effects ur ver remve internet free business hpl ch! ch$ CAPMAX CAPTOT Negative effects hp gerge re edu functin is summarized by the cefficient, standard errr and Z-scre; the latter is the cefficient divided by its standard errr, and is cnsidered significant if it exceeds the apprpriate quantile f a standard nrmal distributin. The clumn labeled nnlinear P -value is a test f nnlinearity f the estimated functin. Nte, hwever, that the effect f each predictr is fully adjusted fr the entire effects f the ther predictrs, nt just fr their linear parts. The predictrs shwn in the table were judged significant by at least ne f the tests (linear r nnlinear) at the p = 0.01 level (tw-sided). Figure 9.1 shws the estimated functins fr the significant predictrs appearing in Table 9.2. Many f the nnlinear effects appear t accunt fr a strng discntinuity at zer. Fr example, the prbability f spam drps significantly as the frequency f gerge increases frm zer, but then des nt change much after that. This suggests that ne might replace each f the frequency predictrs by an indicatr variable fr a zer cunt, and resrt t a linear lgistic mdel. This gave a test errr rate f 7.4%; including the linear effects f the frequencies as well drpped the test errr t 6.6%. It appears that the nnlinearities in the additive mdel have an additinal predictive pwer.

11 9.1 Generalized Additive Mdels 9 PSfrag replacements ur ver remve internet free business hp hpl ˆf(gerge) ˆf(1999) ˆf(re) ˆf(edu) ˆf(free) ˆf(business) ˆf(hp) ˆf(hpl) ˆf(ur) ˆf(ver) ˆf(ch!) ˆf(remve) ˆf(internet) gerge 1999 re edu ˆf(ch$) ˆf(CAPMAX) ˆf(CAPTOT) ch! ch$ CAPMAX CAPTOT FIGURE 9.1. Spam analysis: estimated functins fr significant predictrs. The rug plt alng the bttm f each frame indicates the bserved values f the crrespnding predictr. Fr many f the predictrs the nnlinearity picks up the discntinuity at zer.

12 10 9. Additive Mdels, Trees, and Related Methds It is mre serius t classify a genuine message as spam, since then a gd wuld be filtered ut and wuld nt reach the user. We can alter the balance between the class errr rates by changing the lsses (see Sectin 2.4). If we assign a lss L 01 fr predicting a true class 0 as class 1, and L 10 fr predicting a true class 1 as class 0, then the estimated Bayes rule predicts class 1 if its prbability is greater than L 01 /(L 01 + L 10 ). Fr example, if we take L 01 = 10, L 10 = 1 then the (true) class 0 and class 1 errr rates change t 0.8% and 8.7%. Mre ambitiusly, we can encurage the mdel t fit better data in the class 0 by using weights L 01 fr the class 0 bservatins and L 10 fr the class 1 bservatins. As abve, we then use the estimated Bayes rule t predict. This gave errr rates f 1.2% and 8.0% in (true) class 0 and class 1, respectively. We discuss belw the issue f unequal lsses further, in the cntext f tree-based mdels. After fitting an additive mdel, ne shuld check whether the inclusin f sme interactins can significantly imprve the fit. This can be dne manually, by inserting prducts f sme r all f the significant inputs, r autmatically via the MARS prcedure (Sectin 9.4). This example uses the additive mdel in an autmatic fashin. As a data analysis tl, additive mdels are ften used in a mre interactive fashin, adding and drpping terms t determine their effect. By calibrating the amunt f smthing in terms f df j, ne can mve seamlessly between linear mdels (df j = 1) and partially linear mdels, where sme terms are mdeled mre flexibly. See Hastie and Tibshirani (1990) fr mre details Summary Additive mdels prvide a useful extensin f linear mdels, making them mre flexible while still retaining much f their interpretability. The familiar tls fr mdelling and inference in linear mdels are als available fr additive mdels, seen fr example in Table 9.2. The backfitting prcedure fr fitting these mdels is simple and mdular, allwing ne t chse a fitting methd apprpriate fr each input variable. As a result they have becme widely used in the statistical cmmunity. Hwever additive mdels can have limitatins fr large data-mining applicatins. The backfitting algrithm fits all predictrs, which is nt feasible r desirable when a large number are available. The BRUTO prcedure (Hastie and Tibshirani 1990, Chapter 9) cmbines backfitting with selectin f inputs, but is nt designed fr large data-mining prblems. There has als been recent wrk using lass-type penalties t estimate sparse additive mdels, fr example the COSSO prcedure f Lin and Zhang (2003) and the SpAM prpsal f Ravikumar et al. (2007). Fr large prblems a frward stagewise apprach such as bsting (Chapter 14) is mre effective, and als allws fr interactins t be included in the mdel.

13 9.2 Tree-Based Methds Tree-Based Methds Backgrund Tree-based methds partitin the feature space int a set f rectangles, and then fit a simple mdel (like a cnstant) in each ne. They are cnceptually simple yet pwerful. We first describe a ppular methd fr tree-based regressin and classificatin called CART, and later cntrast it with C4.5, a majr cmpetitr. Let s cnsider a regressin prblem with cntinuus respnse Y and inputs X 1 and X 2, each taking values in the unit interval. The tp left panel f Figure 9.2 shws a partitin f the feature space by lines that are parallel t the crdinate axes. In each partitin element we can mdel Y with a different cnstant. Hwever, there is a prblem: althugh each partitining line has a simple descriptin like X 1 = c, sme f the resulting regins are cmplicated t describe. T simplify matters, we restrict attentin t recursive binary partitins like that in the tp right panel f Figure 9.2. We first split the space int tw regins, and mdel the respnse by the mean f Y in each regin. We chse the variable and split-pint t achieve the best fit. Then ne r bth f these regins are split int tw mre regins, and this prcess is cntinued, until sme stpping rule is applied. Fr example, in the tp right panel f Figure 9.2, we first split at X 1 = t 1. Then the regin X 1 t 1 is split at X 2 = t 2 and the regin X 1 > t 1 is split at X 1 = t 3. Finally, the regin X 1 > t 3 is split at X 2 = t 4. The result f this prcess is a partitin int the five regins R 1, R 2,..., R 5 shwn in the figure. The crrespnding regressin mdel predicts Y with a cnstant c m in regin R m, that is, ˆf(X) = 5 c m I{(X 1, X 2 ) R m }. (9.9) m=1 This same mdel can be represented by the binary tree in the bttm left panel f Figure 9.2. The full dataset sits at the tp f the tree. Observatins satisfying the cnditin at each junctin are assigned t the left branch, and the thers t the right branch. The terminal ndes r leaves f the tree crrespnd t the regins R 1, R 2,..., R 5. The bttm right panel f Figure 9.2 is a perspective plt f the regressin surface frm this mdel. Fr illustratin, we chse the nde means c 1 = 5, c 2 = 7, c 3 = 0, c 4 = 2, c 5 = 4 t make this plt. A key advantage f the recursive binary tree is its interpretability. The feature space partitin is fully described by a single tree. With mre than tw inputs, partitins like that in the tp right panel f Figure 9.2 are difficult t draw, but the binary tree representatin wrks in the same way. This representatin is als ppular amng medical scientists, perhaps because it mimics the way that a dctr thinks. The tree stratifies the

14 12 9. Additive Mdels, Trees, and Related Methds R 5 R 2 t 4 X2 X2 t 2 R 3 R 4 R 1 PSfrag replacements X 1 t 1 X 1 t 3 X 1 t 1 X 2 t 2 X 1 t 3 X 2 t 4 R 1 R 2 R 3 X 2 X 1 R 4 R 5 FIGURE 9.2. Partitins and CART. Tp right panel shws a partitin f a tw-dimensinal feature space by recursive binary splitting, as used in CART, applied t sme fake data. Tp left panel shws a general partitin that cannt be btained frm recursive binary splitting. Bttm left panel shws the tree crrespnding t the partitin in the tp right panel, and a perspective plt f the predictin surface appears in the bttm right panel.

15 9.2 Tree-Based Methds 13 ppulatin int strata f high and lw utcme, n the basis f patient characteristics Regressin Trees We nw turn t the questin f hw t grw a regressin tree. Our data cnsists f p inputs and a respnse, fr each f N bservatins: that is, (x i, y i ) fr i = 1, 2,..., N, with x i = (x i1, x i2,..., x ip ). The algrithm needs t autmatically decide n the splitting variables and split pints, and als what tplgy (shape) the tree shuld have. Suppse first that we have a partitin int M regins R 1, R 2,..., R M, and we mdel the respnse as a cnstant c m in each regin: f(x) = M c m I(x R m ). (9.10) m=1 If we adpt as ur criterin minimizatin f the sum f squares (y i f(x i )) 2, it is easy t see that the best ĉ m is just the average f y i in regin R m : ĉ m = ave(y i x i R m ). (9.11) Nw finding the best binary partitin in terms f minimum sum f squares is generally cmputatinally infeasible. Hence we prceed with a greedy algrithm. Starting with all f the data, cnsider a splitting variable j and split pint s, and define the pair f half-planes R 1 (j, s) = {X X j s} and R 2 (j, s) = {X X j > s}. (9.12) Then we seek the splitting variable j and split pint s that slve [ min min (y i c 1 ) 2 + min (y i c 2 ) 2]. (9.13) j, s c 1 c 2 x i R 1(j,s) x i R 2(j,s) Fr any chice j and s, the inner minimizatin is slved by ĉ 1 = ave(y i x i R 1 (j, s)) and ĉ 2 = ave(y i x i R 2 (j, s)). (9.14) Fr each splitting variable, the determinatin f the split pint s can be dne very quickly and hence by scanning thrugh all f the inputs, determinatin f the best pair (j, s) is feasible. Having fund the best split, we partitin the data int the tw resulting regins and repeat the splitting prcess n each f the tw regins. Then this prcess is repeated n all f the resulting regins. Hw large shuld we grw the tree? Clearly a very large tree might verfit the data, while a small tree might nt capture the imprtant structure.

16 14 9. Additive Mdels, Trees, and Related Methds Tree size is a tuning parameter gverning the mdel s cmplexity, and the ptimal tree size shuld be adaptively chsen frm the data. One apprach wuld be t split tree ndes nly if the decrease in sum-f-squares due t the split exceeds sme threshld. This strategy is t shrt-sighted, hwever, since a seemingly wrthless split might lead t a very gd split belw it. The preferred strategy is t grw a large tree T 0, stpping the splitting prcess nly when sme minimum nde size (say 5) is reached. Then this large tree is pruned using cst-cmplexity pruning, which we nw describe. We define a subtree T T 0 t be any tree that can be btained by pruning T 0, that is, cllapsing any number f its internal (nn-terminal) ndes. We index terminal ndes by m, with nde m representing regin R m. Let T dente the number f terminal ndes in T. Letting N m = #{x i R m }, ĉ m = 1 y i, N m x i R m Q m (T ) = 1 (y i ĉ m ) 2, N m x i R m (9.15) we define the cst cmplexity criterin C α (T ) = T m=1 N m Q m (T ) + α T. (9.16) The idea is t find, fr each α, the subtree T α T 0 t minimize C α (T ). The tuning parameter α 0 gverns the tradeff between tree size and its gdness f fit t the data. Large values f α result in smaller trees T α, and cnversely fr smaller values f α. As the ntatin suggests, with α = 0 the slutin is the full tree T 0. We discuss hw t adaptively chse α belw. Fr each α ne can shw that there is a unique smallest subtree T α that minimizes C α (T ). T find T α we use weakest link pruning: we successively cllapse the internal nde that prduces the smallest per-nde increase in m N mq m (T ), and cntinue until we prduce the single-nde (rt) tree. This gives a (finite) sequence f subtrees, and ne can shw this sequence must cntain T α. See Breiman et al. (1984) r Ripley (1996) fr details. Estimatin f α is achieved by five- r tenfld crss-validatin: we chse the value ˆα t minimize the crss-validated sum f squares. Our final tree is Tˆα Classificatin Trees If the target is a classificatin utcme taking values 1, 2,..., K, the nly changes needed in the tree algrithm pertain t the criteria fr splitting ndes and pruning the tree. Fr regressin we used the squared-errr nde

17 9.2 Tree-Based Methds Gini index Misclassificatin errr Entrpy p FIGURE 9.3. Nde impurity measures fr tw-class classificatin, as a functin f the prprtin p in class 2. Crss-entrpy has been scaled t pass thrugh (0.5, 0.5). impurity measure Q m (T ) defined in (9.15), but this is nt suitable fr classificatin. In a nde m, representing a regin R m with N m bservatins, let ˆp mk = 1 I(y i = k), N m x i R m the prprtin f class k bservatins in nde m. We classify the bservatins in nde m t class k(m) = arg max k ˆp mk, the majrity class in nde m. Different measures Q m (T ) f nde impurity include the fllwing: 1 Misclassificatin errr: N m i R m I(y i k(m)) = 1 ˆp mk(m). Gini index: k k ˆp mk ˆp mk = K k=1 ˆp mk(1 ˆp mk ). Crss-entrpy r deviance: K k=1 ˆp mk lg ˆp mk. (9.17) Fr tw classes, if p is the prprtin in the secnd class, these three measures are 1 max(p, 1 p), 2p(1 p) and p lg p (1 p) lg (1 p), respectively. They are shwn in Figure 9.3. All three are similar, but crssentrpy and the Gini index are differentiable, and hence mre amenable t numerical ptimizatin. Cmparing (9.13) and (9.15), we see that we need t weight the nde impurity measures by the number N ml and N mr f bservatins in the tw child ndes created by splitting nde m. In additin, crss-entrpy and the Gini index are mre sensitive t changes in the nde prbabilities than the misclassificatin rate. Fr example, in a tw-class prblem with 400 bservatins in each class (dente this by (400, 400)), suppse ne split created ndes (300, 100) and (100, 300), while

18 16 9. Additive Mdels, Trees, and Related Methds the ther created ndes (200, 400) and (200, 0). Bth splits prduce a misclassificatin rate f 0.25, but the secnd split prduces a pure nde and is prbably preferable. Bth the Gini index and crss-entrpy are lwer fr the secnd split. Fr this reasn, either the Gini index r crss-entrpy shuld be used when grwing the tree. T guide cst-cmplexity pruning, any f the three measures can be used, but typically it is the misclassificatin rate. The Gini index can be interpreted in tw interesting ways. Rather than classify bservatins t the majrity class in the nde, we culd classify them t class k with prbability ˆp mk. Then the training errr rate f this rule in the nde is k k ˆp mk ˆp mk the Gini index. Similarly, if we cde each bservatin as 1 fr class k and zer therwise, the variance ver the nde f this 0-1 respnse is ˆp mk (1 ˆp mk ). Summing ver classes k again gives the Gini index Other Issues Categrical Predictrs When splitting a predictr having q pssible unrdered values, there are 2 q 1 1 pssible partitins f the q values int tw grups, and the cmputatins becme prhibitive fr large q. Hwever, with a 0 1 utcme, this cmputatin simplifies. We rder the predictr classes accrding t the prprtin falling in utcme class 1. Then we split this predictr as if it were an rdered predictr. One can shw this gives the ptimal split, in terms f crss-entrpy r Gini index, amng all pssible 2 q 1 1 splits. This result als hlds fr a quantitative utcme and square errr lss the categries are rdered by increasing mean f the utcme. Althugh intuitive, the prfs f these assertins are nt trivial. The prf fr binary utcmes is given in Breiman et al. (1984) and Ripley (1996); the prf fr quantitative utcmes can be fund in?. Fr multicategry utcmes, n such simplificatins are pssible, althugh varius apprximatins have been prpsed (Lh and Vanichsetakul 1988). The partitining algrithm tends t favr categrical predictrs with many levels q; the number f partitins grws expnentially in q, and the mre chices we have, the mre likely we can find a gd ne fr the data at hand. This can lead t severe verfitting if q is large, and such variables shuld be avided. The Lss Matrix In classificatin prblems, the cnsequences f misclassifying bservatins are mre serius in sme classes than thers. Fr example, it is prbably wrse t predict that a persn will nt have a heart attack when he/she actually will, than vice versa. T accunt fr this, we define a K K lss matrix L, with L kk being the lss incurred fr classifying a class k bservatin as class k. Typically n lss is incurred fr crrect classificatins,

19 9.2 Tree-Based Methds 17 that is, L kk = 0 k. T incrprate the lsses int the mdeling prcess, we culd mdify the Gini index t k k L kk ˆp mk ˆp mk ; this wuld be the expected lss incurred by the randmized rule. This wrks fr the multiclass case, but in the tw-class case has n effect, since the cefficient f ˆp mk ˆp mk is L kk + L k k. Fr tw classes a better apprach is t weight the bservatins in class k by L kk. This can be used in the multiclass case nly if, as a functin f k, L kk desn t depend n k. Observatin weighting can be used with the deviance as well. The effect f bservatin weighting is t alter the prir prbability n the classes. In a terminal nde, the empirical Bayes rule implies that we classify t class k(m) = arg min k l L lk ˆp ml. Missing Predictr Values Suppse ur data has sme missing predictr values in sme r all f the variables. We might discard any bservatin with sme missing values, but this culd lead t serius depletin f the training set. Alternatively we might try t fill in (impute) the missing values, with say the mean f that predictr ver the nnmissing bservatins. Fr tree-based mdels, there are tw better appraches. The first is applicable t categrical predictrs: we simply make a new categry fr missing. Frm this we might discver that bservatins with missing values fr sme measurement behave differently than thse with nnmissing values. The secnd mre general apprach is the cnstructin f surrgate variables. When cnsidering a predictr fr a split, we use nly the bservatins fr which that predictr is nt missing. Having chsen the best (primary) predictr and split pint, we frm a list f surrgate predictrs and split pints. The first surrgate is the predictr and crrespnding split pint that best mimics the split f the training data achieved by the primary split. The secnd surrgate is the predictr and crrespnding split pint that des secnd best, and s n. When sending bservatins dwn the tree either in the training phase r during predictin, we use the surrgate splits in rder, if the primary spitting predictr is missing. Surrgate splits explit crrelatins between predictrs t try and alleviate the effect f missing data. The higher the crrelatin between the missing predictr and the ther predictrs, the smaller the lss f infrmatin due t the missing value. The general prblem f missing data is discussed in Sectin 9.6, Why Binary Splits? Rather than splitting each nde int just tw grups at each stage (as abve), we might cnsider multiway splits int mre than tw grups. While this can smetimes be useful, it is nt a gd general strategy. The prblem is that multiway splits fragment the data t quickly, leaving insufficient data at the next level dwn. Hence we wuld want t use such splits nly when needed. Since multiway splits can be achieved by a series f binary splits, the latter are preferred.

20 18 9. Additive Mdels, Trees, and Related Methds Other Tree-Building Prcedures The discussin abve fcuses n the CART (classificatin and regressin tree) implementatin f trees. The ther ppular methdlgy is ID3 and its later versins, C4.5 and C5.0 (Quinlan 1993). Early versins f the prgram were limited t categrical predictrs, and used a tp-dwn rule with n pruning. With mre recent develpments, C5.0 has becme quite similar t CART. The mst significant feature unique t C5.0 is a scheme fr deriving rule sets. After a tree is grwn, the splitting rules that define the terminal ndes can smetimes be simplified: that is, ne r mre cnditin can be drpped withut changing the subset f bservatins that fall in the nde. We end up with a simplified set f rules defining each terminal nde; these n lnger fllw a tree structure, but their simplicity might make them mre attractive t the user. Linear Cmbinatin Splits Rather than restricting splits t be f the frm X j s, ne can allw splits alng linear cmbinatins f the frm a j X j s. The weights a j and split pint s are ptimized t minimize the relevant criterin (such as the Gini index). While this can imprve the predictive pwer f the tree, it can hurt interpretability. Cmputatinally, the discreteness f the split pint search precludes the use f a smth ptimizatin fr the weights. A better way t incrprate linear cmbinatin splits is in the hierarchical mixtures f experts (HME) mdel, the tpic f Sectin 9.5. Instability f Trees One majr prblem with trees is their high variance. Often a small change in the data can result in a very different series f splits, making interpretatin smewhat precarius. The majr reasn fr this instability is the hierarchical nature f the prcess: the effect f an errr in the tp split is prpagated dwn t all f the splits belw it. One can alleviate this t sme degree by trying t use a mre stable split criterin, but the inherent instability is nt remved. It is the price t be paid fr estimating a simple, tree-based structure frm the data. Bagging (Sectin 8.7) averages many trees t reduce this variance. Lack f Smthness Anther limitatin f trees is the lack f smthness f the predictin surface, as can be seen in the bttm right panel f Figure 9.2. In classificatin with 0/1 lss, this desn t hurt much, since bias in estimatin f the class prbabilities has a limited effect. Hwever, this can degrade perfrmance in the regressin setting, where we wuld nrmally expect the underlying functin t be smth. The MARS prcedure, described in Sectin 9.4,

21 9.2 Tree-Based Methds 19 TABLE 9.3. Spam data: cnfusin rates fr the 17-nde tree (chsen by crss validatin) n the test data. Overall errr rate is 9.3%. Predicted True spam 57.3% 4.0% spam 5.3% 33.4% can be viewed as a mdificatin f CART designed t alleviate this lack f smthness. Difficulty in Capturing Additive Structure Anther prblem with trees is their difficulty in mdeling additive structure. In regressin, suppse, fr example, that Y = c 1 I(X 1 < t 1 )+c 2 I(X 2 < t 2 ) + ε where ε is zer-mean nise. Then a binary tree might make its first split n X 1 near t 1. At the next level dwn it wuld have t split bth ndes n X 2 at t 2 in rder t capture the additive structure. This might happen with sufficient data, but the mdel is given n special encuragement t find such structure. If there were ten rather than tw additive effects, it wuld take many frtuitus splits t recreate the structure, and the data analyst wuld be hard pressed t recgnize it in the estimated tree. The blame here can again be attributed t the binary tree structure, which has bth advantages and drawbacks. Again the MARS methd (Sectin 9.4) gives up this tree structure in rder t capture additive structure Spam Example (Cntinued) We applied the classificatin tree methdlgy t the spam example intrduced earlier. We used the deviance measure t grw the tree and misclassificatin rate t prune it. Figure 9.4 shws the 10-fld crss-validatin errr rate as a functin f the size f the pruned tree, alng with ±2 standard errrs f the mean, frm the ten replicatins. The test errr curve is shwn in range. Nte that the crss-validatin errr rates are indexed by a sequence f values f α and nt tree size; fr trees grwn in different flds, a value f α might imply different sizes. The sizes shwn at the base f the plt refer t T α, the sizes f the pruned riginal tree. The errr flattens ut at arund 17 terminal ndes, giving the pruned tree in Figure 9.5. Of the 13 distinct features chsen by the tree, 11 verlap with the 16 significant features in the additive mdel (Table 9.2). The verall errr rate shwn in Table 9.3 is abut 50% higher than fr the additive mdel in Table 9.1. Cnsider the rightmst branches f the tree. We branch t the right with a spam warning if mre than 5.5% f the characters are the $ sign.

22 20 9. Additive Mdels, Trees, and Related Methds α Misclassificatin Rate PSfrag replacements Tree Size FIGURE 9.4. Results fr spam example. The blue curve is the 10-fld crss validatin estimate f misclassificatin rate as a functin f tree size, with ± tw standard errr bars. The minimum ccurs at a tree size with abut 17 terminal ndes. The range curve is the test errr, which tracks the CV errr quite clsely. The crss-validatin was indexed by values f α, shwn abve. The tree sizes shwn belw refer t T α, the size f the riginal tree indexed by α. Hwever, if in additin the phrase hp ccurs frequently, then this is likely t be cmpany business and we classify as . All f the 22 cases in the test set satisfying these criteria were crrectly classified. If the secnd cnditin is nt met, and in additin the average length f repeated capital letters CAPAVE is larger than 2.9, then we classify as spam. Of the 227 test cases, nly seven were misclassified. In medical classificatin prblems, the terms sensitivity and specificity are used t characterize a rule. They are defined as fllws: Sensitivity: prbability f predicting disease given true state is disease. Specificity: prbability f predicting nn-disease given true state is nndisease.

23 9.2 Tree-Based Methds 21 PSfrag replacements 600/1536 ch$< ch$> remve< /1177 remve>0.06 spam 48/359 hp<0.405 hp> spam spam 180/1065 9/112 26/337 0/22 ch!<0.191 ch!>0.191 gerge<0.15 CAPAVE<2.907 gerge>0.15 CAPAVE> spam spam spam 80/ /204 6/109 0/3 19/110 7/227 gerge<0.005 gerge>0.005 CAPAVE< CAPAVE> < > spam spam 80/652 0/209 36/123 16/81 18/109 0/1 hp<0.03 hp>0.03 free<0.065 free> spam 77/423 3/229 16/94 9/29 CAPMAX<10.5 business<0.145 CAPMAX>10.5 business> / /185 spam 14/89 3/5 receive<0.125 edu<0.045 receive>0.125 edu> spam 19/236 1/2 48/113 9/72 ur<1.2 ur>1.2 spam 37/101 1/12 FIGURE 9.5. The pruned tree fr the spam example. The split variables are shwn in blue n the branches, and the classificatin is shwn in every nde.the numbers under the terminal ndes indicate misclassificatin rates n the test data.

24 22 9. Additive Mdels, Trees, and Related Methds Sensitivity Tree (0.95) GAM (0.98) Weighted Tree (0.90) Specificity FIGURE 9.6. ROC curves fr the classificatin rules fit t the spam data. Curves that are clser t the nrtheast crner represent better classifiers. In this case the GAM classifier dminates the trees. The weighted tree achieves better sensitivity fr higher specificity than the unweighted tree. The numbers in the legend represent the area under the curve. If we think f spam and as the presence and absence f disease, respectively, then frm Table 9.3 we have 33.4 Sensitivity = = 86.3%, Specificity = = 93.4%. In this analysis we have used equal lsses. As befre let L kk be the lss assciated with predicting a class k bject as class k. By varying the relative sizes f the lsses L 01 and L 10, we increase the sensitivity and decrease the specificity f the rule, r vice versa. In this example, we want t avid marking gd as spam, and thus we want the specificity t be very high. We can achieve this by setting L 01 > 1 say, with L 10 = 1. The Bayes rule in each terminal nde classifies t class 1 (spam) if the prprtin f spam is L 01 /(L 10 + L 01 ), and class zer therwise. The

25 9.3 PRIM: Bump Hunting 23 receiver perating characteristic curve (ROC) is a cmmnly used summary fr assessing the tradeff between sensitivity and specificity. It is a plt f the sensitivity versus specificity as we vary the parameters f a classificatin rule. Varying the lss L 01 between 0.1 and 10, and applying Bayes rule t the 17-nde tree selected in Figure 9.4, prduced the ROC curve shwn in Figure 9.6. The standard errr f each curve near 0.9 is apprximately 0.9(1 0.9)/1536 = 0.008, and hence the standard errr f the difference is abut We see that in rder t achieve a specificity f clse t 100%, the sensitivity has t drp t abut 50%. The area under the curve is a cmmnly used quantitative summary; extending the curve linearly in each directin s that it is defined ver [0, 100], the area is apprximately Fr cmparisn, we have included the ROC curve fr the GAM mdel fit t these data in Sectin 9.2; it gives a better classificatin rule fr any lss, with an area f Rather than just mdifying the Bayes rule in the ndes, it is better t take full accunt f the unequal lsses in grwing the tree, as was dne in Sectin 9.2. With just tw classes 0 and 1, lsses may be incrprated int the tree-grwing prcess by using weight L k,1 k fr an bservatin in class k. Here we chse L 01 = 5, L 10 = 1 and fit the same size tree as befre ( T α = 17). This tree has higher sensitivity at high values f the specificity than the riginal tree, but des mre prly at the ther extreme. Its tp few splits are the same as the riginal tree, and then it departs frm it. Fr this applicatin the tree grwn using L 01 = 5 is clearly better than the riginal tree. The area under the ROC curve, used abve, is smetimes called the c- statistic. Interestingly, it can be shwn that the area under the ROC curve is equivalent t the Mann-Whitney U statistic (r Wilcxn rank-sum test), fr the median difference between the predictin scres in the tw grups (Hanley and McNeil 1982). Fr evaluating the cntributin f an additinal predictr when added t a standard mdel, the c-statistic may nt be an infrmative measure. The new predictr can be very significant in terms f the change in mdel deviance, but shw nly a small increase in the c- statistic. Fr example, remval f the highly significant term gerge frm the mdel f Table 9.2 results in a decrease in the c-statistic f less than Instead, it is useful t examine hw the additinal predictr changes the classificatin n an individual sample basis. A gd discussin f this pint appears in Ck (2007). 9.3 PRIM: Bump Hunting Tree-based methds (fr regressin) partitin the feature space int bxshaped regins, t try t make the respnse averages in each bx as differ-

26 24 9. Additive Mdels, Trees, and Related Methds ent as pssible. The splitting rules defining the bxes are related t each thrugh a binary tree, facilitating their interpretatin. The patient rule inductin methd (PRIM) als finds bxes in the feature space, but seeks bxes in which the respnse average is high. Hence it lks fr maxima in the target functin, an exercise knwn as bump hunting. (If minima rather than maxima are desired, ne simply wrks with the negative respnse values.) PRIM als differs frm tree-based partitining methds in that the bx definitins are nt described by a binary tree. This makes interpretatin f the cllectin f rules mre difficult; hwever, by remving the binary tree cnstraint, the individual rules are ften simpler. The main bx cnstructin methd in PRIM wrks frm the tp dwn, starting with a bx cntaining all f the data. The bx is cmpressed alng ne face by a small amunt, and the bservatins then falling utside the bx are peeled ff. The face chsen fr cmpressin is the ne resulting in the largest bx mean, after the cmpressin is perfrmed. Then the prcess is repeated, stpping when the current bx cntains sme minimum number f data pints. This prcess is illustrated in Figure 9.7. There are 200 data pints unifrmly distributed ver the unit square. The clr-cded plt indicates the respnse Y taking the value 1 (red) when 0.5 < X 1 < 0.8 and 0.4 < X 2 < 0.6. and zer (blue) therwise. The panels shws the successive bxes fund by the tp-dwn peeling prcedure, peeling ff a prprtin α = 0.1 f the remaining data pints at each stage. Figure 9.8 shws the mean f the respnse values in the bx, as the bx is cmpressed. After the tp-dwn sequence is cmputed, PRIM reverses the prcess, expanding alng any edge, if such an expansin increases the bx mean. This is called pasting. Since the tp-dwn prcedure is greedy at each step, such an expansin is ften pssible. The result f these steps is a sequence f bxes, with different numbers f bservatin in each bx. Crss-validatin, cmbined with the judgment f the data analyst, is used t chse the ptimal bx size. Dente by B 1 the indices f the bservatins in the bx fund in step 1. The PRIM prcedure then remves the bservatins in B 1 frm the training set, and the tw-step prcess tp dwn peeling, fllwed by bttm-up pasting is repeated n the remaining dataset. This entire prcess is repeated several times, prducing a sequence f bxes B 1, B 2,..., B k. Each bx is defined by a set f rules invlving a subset f predictrs like (a 1 X 1 b 1 ) and (b 1 X 3 b 2 ). A summary f the PRIM prcedure is given Algrithm 9.3. PRIM can handle a categrical predictr by cnsidering all partitins f the predictr, as in CART. Missing values are als handled in a manner similar t CART. PRIM is designed fr regressin (quantitative respnse

27 9.3 PRIM: Bump Hunting FIGURE 9.7. Illustratin f PRIM algrithm. There are tw classes, indicated by the blue (class 0) and red (class 1) pints. The prcedure starts with a rectangle (brken black lines) surrunding all f the data, and then peels away pints alng ne edge by a prespecified amunt in rder t maximize the mean f the pints remaining in the bx. Starting at the tp left panel, the sequence f peelings is shwn, until a pure red regin is islated in the bttm right panel. The iteratin number is indicated at the tp f each panel. Number f Observatins in Bx Bx Mean FIGURE 9.8. Bx mean as a functin f number f bservatins in the bx.

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001)

CN700 Additive Models and Trees Chapter 9: Hastie et al. (2001) CN700 Additive Mdels and Trees Chapter 9: Hastie et al. (2001) Madhusudana Shashanka Department f Cgnitive and Neural Systems Bstn University CN700 - Additive Mdels and Trees March 02, 2004 p.1/34 Overview