Tree Structured Classifier Reference: Classificatin and Regressin Trees by L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stne, Chapman & Hall, 98. A Medical Eample (CART): Predict high risk patients wh will nt survive at least 3 days n the basis f the initial 2-hur data. 9 variables are measured during the first 2 hurs. These include bld pressure, age, etc. A tree structure classificatin rule is as fllws: Is the minimum systlic bld pressure ver the initial 2 hur perid > 9? yes n Is sinus tachycardia present? Is age > 2.5 yes yes n n Lw risk High risk High risk Lw risk
Dente the feature space by X. The input vectr X X cntains p features X, X 2,..., X p, sme f which may be categrical. Tree structured classifiers are cnstructed by repeated splits f subsets f X int tw descendant subsets, beginning with X itself. Definitins: nde, terminal nde (leaf nde), parent nde, child nde. The unin f the regins ccupied by tw child ndes is the regin ccupied by their parent nde. Every leaf nde is assigned with a class. A query is assciated with class f the leaf nde it lands in. Ntatin: A nde is dented by t. Its left child nde is dented by t L and right by t R. The cllectin f all the ndes is dented by T ; and the cllectin f all the leaf ndes by T. A split is dented by s. The set f splits is dented by S. 2
X split X X 2 X X 2 X X 3 split 2 split 3 X X 5 X X X 2 X 3 X X 5 X X 7 X X 3 X X 5 X 8 X X 2 X 3 X X 5 X split 2 2 X 7 X 8 3 3
The Three Elements The cnstructin f a tree invlves the fllwing three elements:. The selectin f the splits. 2. The decisins when t declare a nde terminal r t cntinue splitting it. 3. The assignment f each terminal nde t a class. In particular, we need t decide the fllwing:. A set Q f binary questins f the frm {Is X A?}, A X. 2. A gdness f split criterin Φ(s, t) that can be evaluated fr any split s f any nde t. 3. A stp-splitting rule.. A rule fr assigning every terminal nde t a class.
Standard Set f Questins The input vectr X = (X,X 2,...,X p ) cntains features f bth categrical and rdered types. Each split depends n the value f nly a unique variable. Fr each rdered variable X j, Q includes all questins f the frm fr all real-valued c. {Is X j c?} Since the training data set is finite, there are nly finitely many distinct splits that can be generated by the questin {Is X j c?}. If X j is categrical, taking values, say in{, 2,...,M}, then Q cntains all questins f the frm {Is X j A?}. A ranges ver all subsets f {, 2,...,M}. The splits fr all p variables cnstitute the standard set f questins. 5
Gdness f Split The gdness f split is measured by an impurity functin defined fr each nde. Intuitively, we want each leaf nde t be pure, that is, ne class dminates. Definitin: An impurity functin is a functin φ defined n the set f all K-tuples f numbers (p,...,p K ) satisfying p j, j =,..., K, j p j = with the prperties:. φ is a maimum nly at the pint ( K, K,..., K ). 2. φ achieves its minimum nly at the pints (,,..., ), (,,,..., ),..., (,,...,, ). 3. φ is a symmetric functin f p,..., p K, i.e., if yu permute p j, φ remains cnstant.
Definitin: Given an impurity functin φ, define the impurity measure i(t) f a nde t as i(t) = φ(p( t),p(2 t),...,p(k t)), where p(j t) is the estimated prbability f class j within nde t. Gdness f a split s fr nde t, dented by Φ(s, t), is defined by Φ(s, t) = i(s, t) = i(t) p R i(t R ) p L i(t L ), where p R and p L are the prprtins f the samples in nde t that g t the right nde t R and the left nde t L respectively. 7
Define I(t) = i(t)p(t), that is, the impurity functin f nde t weighted by the estimated prprtin f data that g t nde t. The impurity f tree T, I(T) is defined by I(T) = t T I(t) = t T i(t)p(t). Nte fr any nde t the fllwing equatins hld: Define p(t L ) + p(t R ) = p(t) p L = p(t L )/p(t), p R = p(t R )/p(t) p L + p R = I(s, t) = I(t) I(t L ) I(t R ) = p(t)i(t) p(t L )i(t L ) p(t R )i(t R ) = p(t)(i(t) p L i(t L ) p R i(t R )) = p(t) i(s, t) 8
Pssible impurity functin:. Entrpy: K j= p j lg p j. If p j =, use the limit lim pj p j lg p j =. 2. Misclassificatin rate: ma j p j. 3. Gini inde: K j= p j( p j ) = K j= p2 j. Gini inde seems t wrk best in practice fr many prblems. The twing rule: At a nde t, chse the split s that maimizes p L p R 2 p(j t L ) p(j t R ). j 9
Estimate the psterir prbabilities f classes in each nde: The ttal number f samples is N and the number f samples in class j, j K, is N j. The number f samples ging t nde t is N(t); the number f samples with class j ging t nde t is N j (t). K j= N j(t) = N(t). N j (t L ) + N j (t R ) = N j (t). Fr a full tree (balanced), the sum f N(t) ver all the t s at the same level is N. Dente the prir prbability f class j by π j. The prirs π j can be estimated frm the data by N j /N. Smetimes prirs are given befre-hand. The estimated prbability f a sample in class j ging t nde t is p(t j) = N j (t)/n j. p(t L j) + p(t R j) = p(t j). Fr a full tree, the sum f p(t j) ver all t s at the same level is.
The jint prbability f a sample being in class j and ging t nde t is thus: p(j,t) = π j p(t j) = π j N j (t)/n j. The prbability f any sample ging t nde t is: p(t) = K p(j,t) = j= K π j N j (t)/n j. j= Nte p(t L ) + p(t R ) = p(t). The prbability f a sample being in class j given that it ges t nde t is: p(j t) = p(j,t)/p(t). Fr any t, K j= p(j t) =. When π j = N j /N, we have the fllwing simplificatin: p(j t) = N j (t)/n(t). p(t) = N(t)/N. p(j, t) = N j (t)/n.
Stpping Criteria A simple criteria: stp splitting a nde t when ma s S I(s,t) < β, where β is a chsen threshld. The abve stpping criteria is unsatisfactry. A nde with a small decrease f impurity after ne step f splitting may have a large decrease after multiple levels f splits. 2
Class Assignment Rule A class assignment rule assigns a class j = {,...,K} t every terminal nde t T. The class assigned t nde t T is dented by κ(t). Fr - lss, the class assignment rule is: κ(t) = arg ma j p(j t). The resubstitutin estimate r(t) f the prbability f misclassificatin, given that a case falls int nde t is r(t) = ma j Dente R(t) = r(t)p(t). p(j t) = p(κ(t) t). The resubstitutin estimate fr the verall misclassificatin rate R(T) f the tree classifier T is: R(T) = t T R(t). 3
Prpsitin: Fr any split f a nde t int t L and t R, Prf: Dente j = κ(t). R(t) R(t L ) + R(t R ). p(j t) = p(j,t L t) + p(j,t R t) = p(j t L )p(t L t) + p(j t R )p(t R t) = p L p(j t L ) + p R p(j t R ) Hence, p L ma j p(j t L ) + p R ma j p(j t R ) r(t) = p(j [ t) ] p L map(j t L ) + p R map(j t R ) j j = p L ( map(j t L )) + p R ( map(j t R )) j j = p L r(t L ) + p R r(t R ) Finally, R(t) = p(t)r(t) p(t)p L r(t L ) + p(t)p R r(t R ) = p(t L )r(t L ) + p(t R )r(t R ) = R(t L ) + R(t R )
Digit Recgnitin Eample (CART) The digits are shwn by different n-ff cmbinatins f seven hrizntal and vertical lights. Each digit is represented by a 7-dimensinal vectr f zers and nes. The ith sample is i = ( i, i2,..., i7 ). If ij =, the jth light is n; if ij =, the jth light is ff. Digit 2 3 5 7 2 3 5 7 8 9 2 3 5 7 5
The data fr the eample are generated by a malfunctining calculatr. Each f the seven lights has prbability. f being in the wrng state independently. The training set cntains 2 samples generated accrding t the specified distributin. A tree structured classifier is applied. The set f questins Q cntains: Is j =?, j =, 2,..., 7. The twing rule is used in splitting. The pruning crss-validatin methd is used t chse the right sized tree. Classificatin perfrmance: The errr rate estimated by using a test set f size 5 is.3. The errr rate estimated by crss-validatin using the training set is.3. The resubstitutin estimate f the errr rate is.29. The Bayes errr rate is.2. There is little rm fr imprvement ver the tree classifier.
Y X5= N Y X= N Y X2= N 2 Y X= Y X2= N Y X= N 7 3 Y X= N Y X3= N 8 Y X3= N 5 9 Accidently, every digit ccupies ne leaf nde. In general, ne class may ccupy any number f leaf ndes and ccasinally n leaf nde. X and X 7 are never used. 7
Wavefrm Eample (CART) Three functins h (τ), h 2 (τ), h 3 (τ) are shifted versins f each ther, as shwn in the figure. h h 3 h 2 3 5 7 9 3 5 7 9 2 Each h j is specified by the equal-lateral right triangle functin. Its values at integers τ = 2 are measured. 8
The three classes f wavefrms are randm cnve cmbinatins f tw f these wavefrms plus independent Gaussian nise. Each sample is a 2 dimensinal vectr cntaining the values f the randm wavefrms measured at τ =, 2,..., 2. T generate a sample in class, a randm number u unifrmly distributed in [, ] and 2 randm numbers ɛ, ɛ 2,..., ɛ 2 nrmally distributed with mean zer and variance are generated. j = uh (j) + ( u)h 2 (j) + ɛ j, j =,..., 2. T generate a sample in class 2, repeat the abve prcess t generate a randm number u and 2 randm numbers ɛ,..., ɛ 2 and set j = uh (j) + ( u)h 3 (j) + ɛ j, j =,..., 2. Class 3 vectrs are generated by j = uh 2 (j) + ( u)h 3 (j) + ɛ j, j =,..., 2. Eample randm wavefrms are shwn belw. 9
Class 2 2 8 2 2 5 5 2 5 5 2 Class 2 8 2 2 5 5 2 2 2 5 5 2 5 Class 3 2 2 5 5 2 5 5 5 2 2
3 randm samples are generated using prir prbabilities ( 3, 3, 3 ) fr training. Cnstructin f the tree: The set f questins: {Is j c?} fr c ranging ver all real numbers and j =,..., 2. Gini inde is used fr measuring gdness f split. The final tree is selected by pruning and crssvalidatin. Results: The crss-validatin estimate f misclassificatin rate is.29. The misclassificatin rate n a separate test set f size 5 is.28. The Bayes classificatin rule can be derived. Applying this rule t the test set yields a misclassificatin rate f.. 2
2 85 5 27 2 8 7 8 5 9 3 7 9 7 59 9 7 9 7 85 5 9 7 8 5 7 59 3 9 7 9 7 3 8 5 7 2 5 2 7 3 33 3 3 9 3 2 2 2 <=2. <=.8 2<=. 7<=. <=2.5 <=2. 7<=.9 <=. 5<=.9 <=.8 3 3 3 2 22
Advantages f the Tree-Structured Apprach Handles bth categrical and rdered variables in a simple and natural way. Autmatic stepwise variable selectin and cmpleity reductin. It prvides an estimate f the misclassificatin rate fr a query sample. It is invariant under all mntne transfrmatins f individual rdered variables. Rbust t utliers and misclassified pints in the training set. Easy t interpret. 23
Variable Cmbinatins Splits perpendicular t the crdinate aes are inefficient in certain cases. Use linear cmbinatins f variables: Is a j j c? The amunt f cmputatin is increased significantly. Price t pay: mdel cmpleity increases. 2
Missing Values Certain variables are missing in sme training samples. Often ccurs in gene-epressin micrarray data. Suppse each variable has 5% chance being missing independently. Then fr a training sample with 5 variables, the prbability f missing sme variables is as high as 92.3%. A query sample t be classified may have missing variables. Find surrgate splits. Suppse the best split fr nde t is s which invlves a questin n X m. Find anther split s n a variable X j, j m, which is mst similar t s in a certain sense. Similarly, the secnd best surrgate split, the third, and s n, can be fund. 25