Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018
Decision trees
Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model discrete outcomes nicely can be very powerful, can be as complex as you need them C4.5 and CART - from top 10 entries on Kaggle - decision trees are very effective and popular
Sure, But Why Trees? Easy to understand knowledge representation Can handle mixed variables Recursive, divide and conquer learning method Efficient inference Example: Play outside =>
Divide-and-conquer Classification Consider input tuples (x i,y i ) for i-th observation x 2 E s θ 3 B θ 2 C D x 1 > θ 1 A x 2 θ 2 x 2 > θ 3 θ 1 θ 4 x 1 x 1 θ 4 A B C D E
Tree learning Finding best tree is intractable Must consider all 2 m combinations, where m is number of features Often just greedily grow it by splitting attributes one by one. To determine which attribute to split, look at node impurity. Top-down recursive divide and conquer algorithm Start with all examples at root Select best attribute/feature Recurse and repeat Other issues: How to construct features When to stop growing Pruning irrelevant parts of the tree
Fraud Age Degree StartYr Series7 + 22 Y 2005 N - 25 N 2003 Y - 31 Y 1995 Y - 27 Y 1999 Y Score each attribute split for these instances: Age, Degree, StartYr, Series7 + 24 N 2006 N - 29 N 2003 N Y choose split on Series7 N Fraud Age Degree StartYr Series7-25 N 2003 Y - 31 Y 1995 Y - 27 Y 1999 Y Y Fraud Age Degree StartYr Series7 + 22 Y 2005 N + 24 N 2006 N - 29 N 2003 N choose split on Age>28 N Score each attribute split for these instances: Age, Degree, StartYr Fraud Age Degree StartYr Series7-29 N 2003 N Fraud Age Degree StartYr Series7 + 22 Y 2005 N + 24 N 2006 N
t 1 t 3 Overview (with two features and 1D target) X 1 X 1 X 1 t 1 X 2 t 2 X 1 t 3 Y X 2 t 4 R 1 R 2 R 3 X 2 X 1 R 4 R 5 FIGURE 9.2. Partitions and CART. Top right panel shows a partition of a Features: X two-dimensional feature space by recursive 1, X binary 2 splitting, as usedincart, applied to some fake data. Top left Target: panel shows Y a general partition that cannot be obtained from recursive binary splitting. Bottom left panel shows the tree corresponding to the partition in the top right panel, and a perspective plot of the prediction surface appears in the bottom right panel.
Tree models Most well-known systems CART: Breiman, Friedman, Olshen and Stone ID3, C4.5: Quinlan How do they differ? Split scoring function Stopping criterion Pruning mechanism Predictions in leaf nodes
Scoring functions: Local split value
Choosing an attribute/feature Idea: a good feature splits the examples into subsets that distinguish among the class labels as much as possible... ideally into pure sets of "all positive" or "all negative" Patrons? Type? None Some Full French Italian Thai Burger Bias-variance tradeoff: Choosing most discriminating attribute first may not be best tree (bias), but it can make tree small (low variance).
Association between attribute and class label Data Income High Med Low no no yes yes yes no yes yes yes no yes no yes yes Contingency table Attribute value Class label value Buy No buy High 2 2 Med 4 2 Low 3 1
Mathematically Defining Good Split We start with information theory How uncertain of the answer will be if we split the tree this way? Say need to decide between k options Uncertainty in the answer Y {1,,k} when probability is (p 1,..., p k ) can be quantified via entropy: H(p 1,...,p k )= X i p i log 2 p i Convenient notation: B(p) = H(p, 1 p), number of bits necessary to encode
Amount of Information in the Tree Suppose we have p positive and n negative examples at the root B(p/(p + n)) bits needed to classify a new example Information is always conserved If encoding the information in the leaves is lossless then tree has lossless encoding The entropy of the leaves (amount of bits) + the tree information (bits) carried in the tree = total information in the data Let split Y i have p i positive and n i negative examples B(p i /(p i + n i )) bits needed to classify a new example expected number of bits per example over all branches is X i p i + n i p + n B(p i/(p i + n i )) choose the next attribute to split that minimizes the remaining information needed Which maximizes the information in the tree (as information is conserved)
<latexit sha1_base64="xbivkjhhtbzalqgmioraqk8ih6y=">aaahunic3vvnb9naen22tcmhqathlluqsag1krpxkvspiesllkuifcm2qvv6kqyy9lq747bpyn+jxwmhdvanohjhhvuhadreqikvld/pzppmvb3thqkubj3v68li0o3lldrqzfqttdt37q5v3ptgvky5dlmssn8mmqepeuiiqakfuw0sdiuchqm3hf/wglqrknmp4xscma0s0recotmdre/7mcmhz9k+zqmpciqwcknxcnqautwnjgsdmgusrbfdrk8edimja3emctwutsyp1re8ljdzdb60k7bfqnvwtlh8w48uz2jiketmtk/tprhyplfwcxndzwykji/yahrrsuhnwmiwgt2ddj3thvnhtk+0exkke+t5kmwxmem4djffj+airzbe5utl2h8rwjgkgulcy0t9tfjutjcqrkidrzl2ghetxlmud5lmhj3qm1mkf6ns0ut1mqwxg4fwkg4szwkhoeu2keznlamauosa6be1q5acacxcckfopf8jacomrwmnkor28tp86m2g0v3kbsvwwlucsyr6bp23ftc1yaghubw+czwnamqz1p0isc+4iybdr9uobl3rafcibuc6itq76efqjaftdra9r3rnhza936btgo4drxypumfzgkyzfshlbigz2c4+jyio+fxot8+02e6k6ljn4jvezbgdqzxk6uhp/yvn+hmuvv3dnkurc0ldet2fe9jtfox0/fnctl1xknpnz/+ctpdnqdp2r0/p/6tpoumtz/qgfu508fukmqhsxufahofsxaknrfyzo1hs2on1noe/savdlygs2lxqyag4gjx0s5slddde++klma+6ndbllvfuydbes+q+wcupyepsjg3ynoyrfxjauostt+qz+ua+r3xz+vlbqc2voyslfec+mvm1tv8xnmos</latexit> <latexit sha1_base64="59igpv32zwfgr8k1ohldlaycw/g=">aaahmhic3vxnbhmxehabhhl+wjhycamcamqjtcsvukvkxjc4fefopeyq8nonirxvemv726bwvgfpa0d4kz4qv668alpzvwiatjeqwfkub2fm88x8tjxhkowxnne8conylcxa1avr9es3bt66vbxy54nrmebq5uoqvrsya1ik0lxcsthnnba4llatjl4v/p190eao5l0dpxdebjcivudmomlv+yefmzvktlp3ofuthfphhaf2cbqskztqifmw7y2ves1vsug8afdgjvrre29l8acfkz7fuauxzjhe20tt4ji2gkvi635migv8xabqi/zfahiwgwnc4asnndbqh9g+0vhllj1yt5ici40zxyfgfh2y077cejavl9n+88cjjm0sjlxm1m8ktyowateim+zwjhewrgwws/mqacytyjitpdjbkivnxp/jelmrakxiwhgwcjczbmff6kgyfvckudm9dmbiujctwbio6kz6lwbdxprooags3e3+vp94q6hertaogqou4pgl0spnv+njnqhdpcid8w1wnlojjqduf0l6bo8ynr1wo3d1rqnbiroadhnpdtczq5eetnlz9x7sjq3a9h6dtgk6cf4ij+jvqkgmzl6qcjougawxnxmuvhwunj4zbby7quxuixgnurmoikisnb897b/ktd+dsqolmhn1dc6ok6/nc0liwceo458scuo6v9elev7ntd3rlqk2f/2w/k+anpdu+uwpyogvg69s0mwqxtykb8iopyifna7yz55eswohf9pqf5ow4mtaeychfyozfq+wkn5pw1ace+3tq2eeddutfy3v7eo1rafvvfgi98h90irt8oxskddkm3qjjx/jj/kffk19rh3xvtw+l6gxfiroxtkzaj9+as+jwuw=</latexit> <latexit sha1_base64="59igpv32zwfgr8k1ohldlaycw/g=">aaahmhic3vxnbhmxehabhhl+wjhycamcamqjtcsvukvkxjc4fefopeyq8nonirxvemv726bwvgfpa0d4kz4qv668alpzvwiatjeqwfkub2fm88x8tjxhkowxnne8conylcxa1avr9es3bt66vbxy54nrmebq5uoqvrsya1ik0lxcsthnnba4llatjl4v/p190eao5l0dpxdebjcivudmomlv+yefmzvktlp3ofuthfphhaf2cbqskztqifmw7y2ves1vsug8afdgjvrre29l8acfkz7fuauxzjhe20tt4ji2gkvi635migv8xabqi/zfahiwgwnc4asnndbqh9g+0vhllj1yt5ici40zxyfgfh2y077cejavl9n+88cjjm0sjlxm1m8ktyowateim+zwjhewrgwws/mqacytyjitpdjbkivnxp/jelmrakxiwhgwcjczbmff6kgyfvckudm9dmbiujctwbio6kz6lwbdxprooags3e3+vp94q6hertaogqou4pgl0spnv+njnqhdpcid8w1wnlojjqduf0l6bo8ynr1wo3d1rqnbiroadhnpdtczq5eetnlz9x7sjq3a9h6dtgk6cf4ij+jvqkgmzl6qcjougawxnxmuvhwunj4zbby7quxuixgnurmoikisnb897b/ktd+dsqolmhn1dc6ok6/nc0liwceo458scuo6v9elev7ntd3rlqk2f/2w/k+anpdu+uwpyogvg69s0mwqxtykb8iopyifna7yz55eswohf9pqf5ow4mtaeychfyozfq+wkn5pw1ace+3tq2eeddutfy3v7eo1rafvvfgi98h90irt8oxskddkm3qjjx/jj/kffk19rh3xvtw+l6gxfiroxtkzaj9+as+jwuw=</latexit> <latexit sha1_base64="59igpv32zwfgr8k1ohldlaycw/g=">aaahmhic3vxnbhmxehabhhl+wjhycamcamqjtcsvukvkxjc4fefopeyq8nonirxvemv726bwvgfpa0d4kz4qv668alpzvwiatjeqwfkub2fm88x8tjxhkowxnne8conylcxa1avr9es3bt66vbxy54nrmebq5uoqvrsya1ik0lxcsthnnba4llatjl4v/p190eao5l0dpxdebjcivudmomlv+yefmzvktlp3ofuthfphhaf2cbqskztqifmw7y2ves1vsug8afdgjvrre29l8acfkz7fuauxzjhe20tt4ji2gkvi635migv8xabqi/zfahiwgwnc4asnndbqh9g+0vhllj1yt5ici40zxyfgfh2y077cejavl9n+88cjjm0sjlxm1m8ktyowateim+zwjhewrgwws/mqacytyjitpdjbkivnxp/jelmrakxiwhgwcjczbmff6kgyfvckudm9dmbiujctwbio6kz6lwbdxprooags3e3+vp94q6hertaogqou4pgl0spnv+njnqhdpcid8w1wnlojjqduf0l6bo8ynr1wo3d1rqnbiroadhnpdtczq5eetnlz9x7sjq3a9h6dtgk6cf4ij+jvqkgmzl6qcjougawxnxmuvhwunj4zbby7quxuixgnurmoikisnb897b/ktd+dsqolmhn1dc6ok6/nc0liwceo458scuo6v9elev7ntd3rlqk2f/2w/k+anpdu+uwpyogvg69s0mwqxtykb8iopyifna7yz55eswohf9pqf5ow4mtaeychfyozfq+wkn5pw1ace+3tq2eeddutfy3v7eo1rafvvfgi98h90irt8oxskddkm3qjjx/jj/kffk19rh3xvtw+l6gxfiroxtkzaj9+as+jwuw=</latexit> Information gain Information Gain (Gain) is the amount of information that the tree structure encodes H[X] is the entropy: expected number of bits to encode a randomly selected subset X A is the set of subsets of the data with a given split S is the entire data X Gain(S, A) =H[S] A A A S H[A] H[buys_computer] = -9/14 log 9/14-5/14 log 5/14 = 0.9400
<latexit sha1_base64="i7j2vmunum2dglfkylffyjf+qu4=">aaahe3ic3vxnbhmxehyldsx8txdksqwkffabbsiqifspiasslyirwim7qrzeswlfxq9sb9pu2seai7wij8svb+a5eafms6vqng1viigl1x6emc8z/sayo1rwy33/x9lylasrtwur1+s3bt66fwdt/e57ozlnomuuupogogyet6brurvwkgqgmhkwh41eff79i9cgq+sdnaqqsjpiej8zathucys1q0afe5kfrm36lx86vexqrsamqcbe4frkzybwljoqwcaomb22n9rquw05e5dxg8xastmidqaxh/hujfsccd3xtozca6a/9vpk45dyb2o9txjugjoreuywzzqzvsj4nq+x2f6z0pekzswkrezuz4rnlvdo4mvca7nigoayzbfcjw2ppsyiunnzirwtuslk9bkmko5akyvdx2jcqmy7newjk8puqmejtfxemsfnwbqknwzvme5xghux1aobbkhyn/1tppe3ioflyvacy6akpen8yavv+tj2knkd3lnayowpnfwe6kgrpgewj7djt9qhqzcadq/iaebuyk3hptvkseg1o1v+q29722v6v0gnbb0el5dn4wlbqwbmphdijxizbbxtkqorpuma++w1253uivduvia4decqvkkujp7tv+tmpmfz1sxmhbo659sv1/mfibhxenx8u0loxbcqesme/zltzzulqo1fp6x/k6anjhub1qpj8xyivaqawqwli3tm7vbwya1xlx++eywnhl9oq/9zwmshygnrveijulialqhkg4bim9e++ygsgm6n9bzlv328ubtbvrer5d55qjqktz6sxfka7jeuyusrd+qt+vz7wpts+1r7voyul1wce2ru1l7/avu/tg4=</latexit> Gain(S, A) =H[S] X A A A S H[A] Income Entropy(Income=high) = -2/4 log 2/4-2/4 log 2/4 = 1 High Med Low Entropy(Income=med) = -4/6 log 4/6-2/6 log 2/6 = 0.9183 A no no yes yes yes no yes yes yes no yes no yes yes Entropy(Income=low) = -3/4 log 3/4-1/4 log 1/4 = 0.8113 Gain(D,Income) = 0.9400 - (4/14 [1] + 6/14 [0.9183] + 4/14 [0.8113]) = 0.029
Gini gain Similar to information gain Uses gini index instead of entropy Measures decrease in gini index after split: Gain(S, A) = Gigi(S) X A A A S Gini(A)
Comparing information gain to gini gain 0.0 0.1 0.2 0.3 0.4 0.5 Gini index Misclassification error Entropy Gini Entropy 0.0 0.2 0.4 0.6 0.8 1.0 Information Gain Gini Gain p Fraction of target A into branch that outputs B
Comparing information gain to gini gain 0.0 0.1 0.2 0.3 0.4 0.5 Gini index Misclassification error Entropy Gini Entropy 0.0 0.2 0.4 0.6 0.8 1.0 p Fraction of target A into branch that outputs B
How does score function affect feature selection? Entropy Entropy Gini Gini x 2 Gini score can produce larger gain
<latexit sha1_base64="etvy7ia0eyqzalmlvign6e+f2vw=">aaahqhic3vxnbhmxehyldsx8txdk4ljfslebbvb8ckwqxawjs5eirztdrl7vjlfir1e2t21q7zvwnhcef+arochejrpe7co0/bubbjzw++3mfdsz31h2lhkmjed9wvi8cnwpdm35ev3gzvu376ys3n2nzayodknkuu1franncxqnmxz2ugverbx2o/hlwr97aeozmbw1kxrcqyyjgzbkjdp1v54egpgrjdzu5fs+7ubaz6jvwaed749xmfce2qbsm7yfoc829v3cunfex1n3wt504bogxyf1vk2d/ursjycwnboqgmqj1r22l5rqemuy5zdxg0xdsuiydkexh7buj0sadu3rtmccn5w/xgop3jmyplwejfkitj6iyeuwhentvsj4nq+xmcgz0likzqwktew0ydg2ehec4zgpoizphcbumvcupipiddfo1rksxb+nlfzn9bkmgoxbsslcs0lcgc+7rwhj48puqm4irdte6hfjqbce09spm+2xg7exueohakhyo32tpfbwiu5+5yisokrscjled23weud2sbspyw5tof3lqdhsgopbkasn3cih47xaoa03gg0m8rbcn7eihz0zykmim/6mt4g3tndt+w38evgovha87lawe2tghddooxhpylp4nkkw4lom3jxxs+2nxnfpxcuiy3ahwirjxdgz/kvo7dmsq7qeeayu/5y68np+rkg3eof0/fnczlwxknpjz/+ctuftuqftx9+l/5omjys1avfdwdzpemgufdfsfqfpitmjzgqz2lb++umunhj0oc35t9mimwjdiuffksffasx5unpcqlsm2qcvhbog67eet7w3j9a3t6v7yhndrw9qe7xru7snxqed1euuvucf0cf0ufax9rx2rfa9df1cqdj30nyq/fwfzw5eoa==</latexit> Chi-Square score Widely used to test independence between two categorical attributes (e.g., feature and class label) Hypothesis H 0 : Attributes are independent Consider a contingency table with k entries (k = rows x columns) Considers counts in a contingency table and calculates the normalized squared deviation of observed (predicted) values from expected (actual) values given H 0 X 2 = kx i=1 (o i e i ) 2 e i If counts are large (large number of examples), sampling distribution can be approximated by a chi-square distribution
Contingency tables Buy No buy Income High 2 2 Med 4 2 Low 3 1 4 6 4 9 5 14
<latexit sha1_base64="etvy7ia0eyqzalmlvign6e+f2vw=">aaahqhic3vxnbhmxehyldsx8txdk4ljfslebbvb8ckwqxawjs5eirztdrl7vjlfir1e2t21q7zvwnhcef+arochejrpe7co0/bubbjzw++3mfdsz31h2lhkmjed9wvi8cnwpdm35ev3gzvu376ys3n2nzayodknkuu1franncxqnmxz2ugverbx2o/hlwr97aeozmbw1kxrcqyyjgzbkjdp1v54egpgrjdzu5fs+7ubaz6jvwaed749xmfce2qbsm7yfoc829v3cunfex1n3wt504bogxyf1vk2d/ursjycwnboqgmqj1r22l5rqemuy5zdxg0xdsuiydkexh7buj0sadu3rtmccn5w/xgop3jmyplwejfkitj6iyeuwhentvsj4nq+xmcgz0likzqwktew0ydg2ehec4zgpoizphcbumvcupipiddfo1rksxb+nlfzn9bkmgoxbsslcs0lcgc+7rwhj48puqm4irdte6hfjqbce09spm+2xg7exueohakhyo32tpfbwiu5+5yisokrscjled23weud2sbspyw5tof3lqdhsgopbkasn3cih47xaoa03gg0m8rbcn7eihz0zykmim/6mt4g3tndt+w38evgovha87lawe2tghddooxhpylp4nkkw4lom3jxxs+2nxnfpxcuiy3ahwirjxdgz/kvo7dmsq7qeeayu/5y68np+rkg3eof0/fnczlwxknpjz/+ctuftuqftx9+l/5omjys1avfdwdzpemgufdfsfqfpitmjzgqz2lb++umunhj0oc35t9mimwjdiuffksffasx5unpcqlsm2qcvhbog67eet7w3j9a3t6v7yhndrw9qe7xru7snxqed1euuvucf0cf0ufax9rx2rfa9df1cqdj30nyq/fwfzw5eoa==</latexit> Calculating expected values for a cell Class X 2 = kx i=1 (o i e i ) 2 e i Attribute + 0 a b 1 c d o (0,+) = a N e (0,+) = p(a =0,C =+) N = p(a = 0)p(C =+ A = 0) N = p(a = 0)p(C = +) N (assuming independence) apple apple a + b a + c = N N N
Example calculation Observed Expected Buy No buy Buy No buy High 2 2 Med 4 2 Low 3 1 High 2.57 1.43 Med 3.86 2.14 Low 2.57 1.43 χ 2 = ( ) 2 k o i e i % = ' & i=1 e i % + ' & (2 2.57)2 2.57 = 0.57 (2 1.43)2 1.43 ( % * + ' ) & ( % * + ' ) & (4 3.86)2 3.86 (2 2.14)2 2.14 ( % * + ' ) & (3 2.57)2 2.57 ( * ) ( % ( * + ' (1 1.43)2 * ) & 1.43 )
Tree learning Top-down recursive divide and conquer algorithm Start with all examples at root Select best attribute/feature Partition examples by selected attribute Recurse and repeat Other issues: How to construct features When to stop growing Pruning irrelevant parts of the tree
Controlling Variance One major problem with trees is their high variance. Often a small change in the data can result in a very different series of splits, making interpretation somewhat precarious. The major reason for this instability is the hierarchical nature of the process: the effect of an error in the top split is propagated down to all of the splits below it.
Overfitting Consider a distribution D of data representing a population and a sample DS drawn from D, which is used as training data Given a model space M, a score function S, and a learning algorithm that returns a model m M, the algorithm overfits the training data DS if: m' M such that S(m,DS) > S(m',DS) but S(m,D) < S(m',D) In other words, there is another model (m ) that is better on the entire distribution and if we had learned from the full data we would have selected it instead
Task: Devise a rule to classify items based on the attribute X Example learning problem Knowledge representation: If-then rules Example rule: If x > 25 then + Else - + What is the model space? All possible thresholds - What score function? Prediction error rate X
Approaches to avoid overfitting Regularization (Priors) Hold out evaluation set, used to adjust structure of learned model e.g., pruning in decision trees Statistical tests during learning to only include structure with significant associations e.g., pre-pruning in decision trees Penalty term in classifier scoring function i.e., change scorre function to prefer simpler models
How to avoid overfitting in decision trees Postpruning Use a separate set of examples to evaluate the utility of pruning nodes from the tree (after tree is fully grown) Prepruning Apply a statistical test to decide whether to expand a node Use an explicit measure of complexity to penalize large trees (e.g., Minimum Description Length)
Algorithm comparison CART Evaluation criterion: Gini index Search algorithm: Simple to complex, hill-climbing search Stopping criterion: When leaves are pure Pruning mechanism: Cross-validation to select Gini threshold C4.5 Evaluation criterion: Information gain Search algorithm: Simple to complex, hill-climbing search Stopping criterion: When leaves are pure Pruning mechanism: Reduce error pruning
CART: Finding Good Gini Threshold Background: K-fold cross validation Randomly partition training data into k folds For i=1 to k Learn model on D - i th fold; evaluate model on i th fold Average results from all k trials Train1 Train2 Train3 Train4 Train5 Train6 Dataset Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Test1 Test2 Test3 Test4 Test5 Test6 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2 Y X1 X2
Choosing a Gini threshold with cross validation For i in 1.. k For t in threshold set (e.g, [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8]) Learn decision tree on Traini with Gini gain threshold t (i.e. stop growing when max Gini gain is less than t) Evaluate learned tree on Testi (e.g., with accuracy) Set tmax,i to be the t with best performance on Testi Set tmax to the average of tmax,i over the k trials Relearn the tree on all the data using tmax as Gini gain threshold
C4.5: reduced error pruning Use pruning set to estimate accuracy in sub-trees and for individual nodes Let T be a sub-tree rooted at node v v Define: T Repeat: Prune at node with largest gain until until only negative gain nodes remain Bottom-up restriction : T can only be pruned if it does not contain a sub-tree with lower error than T Source: www.ailab.si/blaz/predavanja/uisp/slides/uisp05-postpruning.ppt
Pre-pruning methods Stop growing tree at some point during top-down construction when there is no longer sufficient data to make reliable decisions Approach: Choose threshold on feature score Stop splitting if the best feature score is below threshold
<latexit sha1_base64="etvy7ia0eyqzalmlvign6e+f2vw=">aaahqhic3vxnbhmxehyldsx8txdk4ljfslebbvb8ckwqxawjs5eirztdrl7vjlfir1e2t21q7zvwnhcef+arochejrpe7co0/bubbjzw++3mfdsz31h2lhkmjed9wvi8cnwpdm35ev3gzvu376ys3n2nzayodknkuu1franncxqnmxz2ugverbx2o/hlwr97aeozmbw1kxrcqyyjgzbkjdp1v54egpgrjdzu5fs+7ubaz6jvwaed749xmfce2qbsm7yfoc829v3cunfex1n3wt504bogxyf1vk2d/ursjycwnboqgmqj1r22l5rqemuy5zdxg0xdsuiydkexh7buj0sadu3rtmccn5w/xgop3jmyplwejfkitj6iyeuwhentvsj4nq+xmcgz0likzqwktew0ydg2ehec4zgpoizphcbumvcupipiddfo1rksxb+nlfzn9bkmgoxbsslcs0lcgc+7rwhj48puqm4irdte6hfjqbce09spm+2xg7exueohakhyo32tpfbwiu5+5yisokrscjled23weud2sbspyw5tof3lqdhsgopbkasn3cih47xaoa03gg0m8rbcn7eihz0zykmim/6mt4g3tndt+w38evgovha87lawe2tghddooxhpylp4nkkw4lom3jxxs+2nxnfpxcuiy3ahwirjxdgz/kvo7dmsq7qeeayu/5y68np+rkg3eof0/fnczlwxknpjz/+ctuftuqftx9+l/5omjys1avfdwdzpemgufdfsfqfpitmjzgqz2lb++umunhj0oc35t9mimwjdiuffksffasx5unpcqlsm2qcvhbog67eet7w3j9a3t6v7yhndrw9qe7xru7snxqed1euuvucf0cf0ufax9rx2rfa9df1cqdj30nyq/fwfzw5eoa==</latexit> Determine chi-square threshold analytically Stop growing when chi-square feature score is not statistically significant Chi-square has known sampling distribution, can look up significance threshold Degrees of freedom= (#rows-1)(#cols-1) 2X2 table: 3.84 is 95% critical value X 2 = kx i=1 (o i e i ) 2 e i