Symbolic methods in TC: Decision Trees

Symbolic methods in TC: Decision Trees ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs4062/ kevin.koidl@scss.tcd.ie, maldonaa@tcd.ie 2016-2017

2 A symbolic method: decision trees Symbolic methods offer the advantage that their classification decisions are easily interpretable (by humans). Decision trees: outlook data represented as vectors of discrete-valued (or discretised) attributes classification through binary tests on highly discriminative features. test sequence encoded as a tree structure. sunny overcast rainy humidity yes windy high normal true false no yes no yes

3 A Sample data set outlook temperature humidity windy play 1 sunny hot high false no 2 sunny hot high true no 3 overcast hot high false yes 4 rainy mild high false yes 5 rainy cool normal false yes 6 rainy cool normal true no 7 overcast cool normal true yes 8 sunny mild high false no 9 sunny cool normal false yes 10 rainy mild normal false yes 11 sunny mild normal true yes 12 overcast mild high true yes 13 overcast hot normal false yes 14 rainy mild high true no [Quinlan, 1986]

4 Divide-and-conquer learning strategy Choose the features that better divide the instance space. E.g. distribution of attributes for the tennis weather task: outlook temperature humidity windy 8 play=yes 8 8 8 play=no 6 6 6 6 4 4 4 4 2 2 2 2 0 overcast rainy sunny 0 cool hot mild 0 high normal 0 false true

Uses of decision trees in NLP Parsing [Haruno et al., 1999, Magerman, 1995] Text categorisation [Lewis and Ringuette, 1994] word-sense disambiguation, POS tagging, Speech recognition, Etc...

What s a Decision Tree? A decision tree is a graph with: internal nodes labelled by terms edges labelled by tests (on the weight the term from which they depart has in the document) leaves labelled by categories Given a decision tree T, categorisation of a document d j is done by recursively testing the weights of the internal nodes of T against those in d j until a leaf is reached. Simplest case: d j consists of Boolean (or binary) weights...

7 A Text Categorisation Example wheat wheat farm farm commodity commodity bushels bushels agriculture agriculture WHEAT export export WHEAT WHEAT WHEAT tonnes tonnes WHEAT WHEAT WHEAT winter winter WHEAT soft WHEAT soft WHEAT if (wheat farm) (wheat commodity) (bushels export) (wheat tonnes) (wheat agriculture) (wheat soft) then WHEAT else ( WHEAT)

Building Decision Trees Divide-and-conquer strategy 1. check whether all d j have the same label 2. if not, select t k, partition Tr into classes of documents with the same value for t k, and place each class under a subtree 3. recur on each subtree until each leaf contains training examples assigned with the same category c i 4. label each leaf with its respective c i Some decision tree packages that have been used in TC: ID3, C4.5, C5

9 Decision tree algorithm Algorithm 1: Decision tree learning 1 DTreeLearn (Tr : 2 D, T : 2 T, default : C ): tree 2 / * Tr i s t h e t r a n i n i n g s e t, T i s t h e f e a t u r e s e t * / 3 if isempty (Tr ) then 4 return default 5 else if c j s.t. d i Tr f (d i, c j ) = 1 then / * a l l d i h a v e c l a s s c j * / 6 return c j 7 else if isempty (T ) then 8 return MajorityCateg (Tr ) 9 else 10 t best ChooseFeature (T, Tr ) 11 tree new dtree with root = t best 12 for each v k t best do 13 Tr k {d l Tr t best has value v k in d l } 14 sbt DTreeLearn (Tr k, T \ {t best }, MajorityCateg (Tr k )) 15 add a branch to tree with label v k and subtree sbt 16 done 17 return tree

Important Issues Choosing the right feature (from T ) to partition the training set Choose feature with highest information gain Avoiding overfitting: Memorising all observations from the Tr vs. Extracting patterns, extrapolating to unseen examples in Tv and Te Occam s razor: the most likely hypothesis is the simplest one which is consistent with all observations Inductive bias of DT learning shorter trees are preferred to larger trees.

How do we Implement ChooseFeature? Finding the right feature to partition the feature set is essential. One can use Information Gain (the difference of the entropy of the mother node and the weighted sum of the entropies of the child nodes) yielded by candidate features T : G(T, D) = H(D) p(t i )H(D ti ) (1) t i values(t ) where H(D) is the information entropy of the category distribution on dataset D, that is, for random variable C with values c 1,..., c C with PMF p(c). The sum over values of T is called expected entropy [Quinlan, 1986], and can be written as E(T ) = t i values(t ) C p(t i ) p(c j t i ) log P(c j t j ) j

12 Specifically, for Boolean-valued features For example, the Information Gain yielded by candidate features t k which can take Boolean values (0 or 1) with respect to a binary categorisation task is given by: G(T, D) = H(D) [ D t D H(D t) + D t D H(D t )] (2) where D t and D t are the subsets of D containing instances for which T has value 1 and 0, respectively. And H(D) = D c D log D c D D c D c log (3) D D where D c ( D c ) is the number of positive (negative) instances filed under category c in D.

Numeric vector representations Task: classify REUTERS texts as belonging to category earnings (or not). [...]<title> Cobanco Inc year net</title> <body>shr 34 cts vs 1.19 dlrs Net 807,000 vx 2,858,000 Assets 510.2 mln vs 479 mln Deposits 472 mln vs 440 mln Loans 299.2 mln vs 327 mln Note: 4th qtr not available. Year includes 1985 extraordinary gain from tax carry forward of 132,000 dlrs, or five cts per shr</body>... T =<vs, mln, cts, ;, &, 000, loss,,, 3, profit, dlrs, 1, pct, is, s, that, net, lt, at> d j =< 5, 5, 3, 3, 3, 4, 0, 0, 0, 4, 0, 3, 2, 0, 0, 0, 0, 3, 2, 0>

Creating the text vectors The feature set, T, can be selected (reduced) via one of the information theoretic functions we have seen. Document frequency, G, or χ 2, for example. We could assign a weight (w ij ) to each feature as follows [Manning and Schütze, 1999]: ( ) 1 + log# j (t i ) w ij = round 10 1 + log T l=1 # j(t l ) Features for partitioning Tr can be selected by discretising w ij values (see [Fayyad and Irani, 1993] for a commonly used method) and applying G as shown above. (4)

15 A DT for the earnings category 1 7681 documents P(c n1) = 0.3 cts = 2 Decision boundaries cts < 2 cts >= 2 net 1 1 2 n4 cts 2 5977 documents P(c n2) = 0.116 net = 1 5 1704 documents P(c n5) = 0.943 vs = 2 net < 1 net >= 1 vs < 2 vs >= 2 3 5436 documents P(c n3) = 0.05 4 541 documents P(c n4) = 0.649 6 301 documents P(c n6) = 0.694 7 1403 documents P(c n7) = 0.996 Probability that a document at node n4 belongs to category c = "earnings"

16 Calculating node probabilities One can assign probabilities to a leaf node (i.e. the probability that a new document d belonging to that node should be filled under category c) as follows (using add-one smoothing): P(c d n ) = D cn + 1 D cn + D cn + 1 + 1 (5) where P(c d n ) is the probability that a document d n which ended up in node n belongs to category c, Dcn ( D cn ) number of (training) documents in node n which have been assigned category c ( c)

17 Pruning Large trees tend to overfit the data. Pruning (i.e. removal or overspecific nodes) often helps produce better models. A commonly used approach: 1. Build a full decision tree 2. For each node of height 1 2.1 test for statistical significance wrt to leaf nodes 2.2 remove if expected class distribution (given the parent) is not signicantly different from observed class distribution 2.3 accept node otherwise 3. until all nodes of height 1 have been tested Significance test could be χ 2 = n k C i (O ki E ki ) 2 E ki, for example. O ki is the number of observed instances of category i in partition k (i.e. those for which the test has value v k ; see algorithm 1) and E ki the number of expected instances; e.g. E ki = n i C j n kj C j n j

The Importance of Pruning Comparing full-size and pruned trees: Other pruning techniques include minimum description length (MDL) pruning [Mehta et al., 1995], wrapping (incrementally remove nodes, select the tree which gives peak performance on the validation set by cross-validation), etc [Mingers, 1989].

A final example: WSD Consider the following occurrences of the word bank : RIV y be? Then he ran down along the bank, toward a narrow, muddy path. FIN four bundles of small notes the bank cashier got it into his head RIV ross the bridge and on the other bank you only hear the stream, the RIV beneath the house, where a steep bank of earth is compacted between FIN op but is really the branch of a bank. As I set foot inside, despite FIN raffic police also belong to the bank. More foolhardy than entering FIN require a number. If you open a bank account, the teller identifies RIV circular movement, skirting the bank of the River Jordan, then turn The WSD learning task is to learn to distinguish between meanings financial institution (FIN) and the land alongside a river (RIV).

Task definition WSD can be described as a categorisation task where senses (FIN, RIV) are labels (C) the representation of instances (D) comes from the context surrounding the words to be disambiguated. E.g.: For T = {along, cashier, stream, muddy,... }, we could have: d 1 = along = 1, cashier = 0, stream = 0, muddy = 1,... and f (d1 ) = RIV Performance can be measured as in text categorisation (precision, recall, etc.)

A decision tree on 1 0 RIV river 1 0 Using the algorithm above (slide 9) we get this decision tree RIV RIV when 1 0 from 1 0 money 1 0 FIN FIN RIV Trained on small training set and T = {small, money, on, to, river, from, in, his, accounts, when, by, other, estuary, some, with}

Other topics Cost-sensitive classification; see [Lomax and Vadera, 2013] for a comprehensive survey Alternative attribute selection criteria: gain ratio, distance-based measures etc [Mitchell, 1997] Regression Missing features...

23 References I Fayyad, U. M. and Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th International Joint Conference in Artificial Intelligence, pages 1022 1026. Morgan Kaufmann. Haruno, M., Shirai, S., and Ooyama, Y. (1999). Using decision trees to construct a practical parser. Machine Learning, 34(1):131 149. Lewis, D. and Ringuette, M. (1994). A comparison of two learning algorithms for text categorization. In Third Annual Symposium on Document Analysis and Information Retrieval, pages 81 93. Lomax, S. and Vadera, S. (2013). A survey of cost-sensitive decision tree induction algorithms. ACM Computing Surveys, 45(2):16:1 16:35. Magerman, D. M. (1995). Statistical decision-tree models for parsing. In Meeting of the Association for Computational Linguistics, pages 276 283. Manning, C. D. and Schütze, H. (1999). Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts.

References II Mehta, M., Rissanen, J., and Agrawal, R. (1995). MDL-based decision tree pruning. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD 95), pages 216 221. Mingers, J. (1989). An empirical comparison of pruning methods for decision tree induction. Machine learning, 4(2):227 243. Mitchell, T. M. (1997). Machine Learning. McGraw-Hill. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1:81 106. 10.1007/BF00116251.