Symbolic methods in TC: Decision Trees ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs0/ kevin.koidl@scss.tcd.ie, maldonaa@tcd.ie 01-017 A symbolic method: decision trees Symbolic methods offer the advantage that their classification decisions are easily interpretable (by humans). outlook sunny overcast rainy Decision trees: data represented as vectors of discrete-valued (or discretised) attributes classification through binary tests on highly discriminative features. no humidity high yes windy normal true false yes no yes test sequence encoded as a tree structure. A Sample data set outlook temperature humidity windy play 1 sunny hot high false no sunny hot high true no 3 overcast hot high false yes rainy mild high false yes 5 rainy cool normal false yes rainy cool normal true no 7 overcast cool normal true yes sunny mild high false no 9 sunny cool normal false yes 10 rainy mild normal false yes 11 sunny mild normal true yes 1 overcast mild high true yes 13 overcast hot normal false yes 1 rainy mild high true no (?) 1
Divide-and-conquer learning strategy Choose the features that better divide the instance space. E.g. distribution of attributes for the tennis weather task: outlook temperature humidity windy play=yes play=no 0 overcast rainy sunny 0 cool hot mild 0 high normal 0 false true Uses of decision trees in NLP Parsing (??) Text categorisation (?) word-sense disambiguation, POS tagging, Speech recognition, Etc... What s a Decision Tree? A decision tree is a graph with: internal nodes labelled by terms edges labelled by tests (on the weight the term from which they depart has in the document) leaves labelled by categories Given a decision tree T, categorisation of a document d j is done by recursively testing the weights of the internal nodes of T against those in d j until a leaf is reached. Simplest case: dj consists of Boolean (or binary) weights...
A Text Categorisation Example wheat wheat farm farm commodity commodity bushels bushels agriculture agriculture export export tonnes tonnes winter winter soft soft if (wheat farm) (wheat commodity) (bushels export) (wheat tonnes) (wheat agriculture) (wheat soft) then else ( ) Building Decision Trees Divide-and-conquer strategy 1. check whether all d j have the same label. if not, select t k, partition T r into classes of documents with the same value for t k, and place each class under a subtree 3. recur on each subtree until each leaf contains training examples assigned with the same category c i. label each leaf with its respective c i Some decision tree packages that have been used in TC: ID3, C.5, C5 Decision tree algorithm Algorithm 1: Decision tree learning 1 DTreeLearn (T r: D, T : T, default : C): tree / * T r i s t h e t r a n i n i n g s e t, T i s t h e f e a t u r e s e t * / 3 if isempty (T r ) then return default 5 else if c j s.t. d i T r f(d i, c j ) = 1 then / * a l l d i h a v e c l a s s c j * / return c j 7 else if isempty (T ) then return MajorityCateg (T r ) 9 else 10 t best ChooseFeature (T, T r ) 11 tree new dtree with root = t best 1 for each v k t best do 13 T r k {d l T r t best has value v k in d l } 1 sbt DTreeLearn (T r k, T \ {t best }, MajorityCateg (T r k )) 15 add a branch to tree with label v k and subtree sbt 1 done 17 return tree 3
Important Issues Choosing the right feature (from T ) to partition the training set Choose feature with highest information gain Avoiding overfitting: Memorising all observations from the T r vs. Extracting patterns, extrapolating to unseen examples in T v and T e Occam s razor: the most likely hypothesis is the simplest one which is consistent with all observations Inductive bias of DT learning shorter trees are preferred to larger trees. How do we Implement ChooseFeature? Finding the right feature to partition the feature set is essential. One can use Information Gain (the difference of the entropy of the mother node and the weighted sum of the entropies of the child nodes) yielded by candidate features T : G(T, D) = H(D) p(t i)h(d ti ) (1) t i values(t ) where H(D) is the information entropy of the category distribution on dataset D, that is, for random variable C with values c 1,..., c C with PMF p(c). The sum over values of T is called expected entropy (?), and can be written as E(T ) = t i values(t ) C p(t i) p(c j t i) log P (c j t j) Recall that Entropy (the H(.) function, above), AKA self-information, measures the amount of uncertainty w.r.t a probability distribution. In other words, entropy is a measure of how much we learn when we observe an event occurring in accordance with this distribution. Specifically, for Boolean-valued features For example, the Information Gain yielded by candidate features t k which can take Boolean values (0 or 1) with respect to a binary categorisation task is given by: G(T, D) = H(D) [ Dt D t H(Dt) + D D H(D t)] () where D t and D t are the subsets of D containing instances for which T has value 1 and 0, respectively. j And H(D) = Dc D Dc log D D c D c log D D where D c ( D c ) is the number of positive (negative) instances filed under category c in D. (3)
Numeric vector representations Task: classify REUTERS texts as belonging to category earnings (or not). [...]<title> Cobanco Inc year net</title> <body>shr 3 cts vs 1.19 dlrs Net 07,000 vx,5,000 Assets 510. mln vs 79 mln Deposits 7 mln vs 0 mln Loans 99. mln vs 37 mln Note: th qtr not available. Year includes 195 extraordinary gain from tax carry forward of 13,000 dlrs, or five cts per shr</body>... T =<vs, mln, cts, ;, &, 000, loss,,, 3, profit, dlrs, 1, pct, is, s, that, net, lt, at> d j =< 5, 5, 3, 3, 3,, 0, 0, 0,, 0, 3,, 0, 0, 0, 0, 3,, 0> Creating the text vectors The feature set, T, can be selected (reduced) via one of the information theoretic functions we have seen. Document frequency, G, or χ, for example. We could assign a weight (w ij ) to each feature as follows (?): ( w ij = round 10 1 + 1 + log Features for partitioning T r can be selected by discretising w ij values (see (?) for a commonly used method) and applying G as shown above. A DT for the earnings category 1 71 documents P(c n1) = 0.3 cts = Decision boundaries cts < cts >= net 1 1 n cts 5977 documents P(c n) = 0.11 net = 1 5 170 documents P(c n5) = 0.93 vs = net < 1 net >= 1 vs < vs >= 3 53 documents P(c n3) = 0.05 51 documents P(c n) = 0.9 301 documents P(c n) = 0.9 7 103 documents P(c n7) = 0.99 Probability that a document at node n belongs to category c = "earnings" Calculating node probabilities One can assign probabilities to a leaf node (i.e. the probability that a new document d belonging to that node should be filled under category c) as follows (using add-one smoothing): P (c d n ) = D cn + 1 D cn + D cn + 1 + 1 (5) 5
where Pruning P (c d n ) is the probability that a document d n which ended up in node n belongs to category c, D cn ( D cn ) number of (training) documents in node n which have been assigned category c ( c) Large trees tend to overfit the data. Pruning (i.e. removal or overspecific nodes) often helps produce better models. A commonly used approach: 1. Build a full decision tree. For each node of height 1 (a) test for statistical significance wrt to leaf nodes (b) remove if expected class distribution (given the parent) is not signicantly different from observed class distribution (c) accept node otherwise 3. until all nodes of height 1 have been tested Significance test could be χ = n C k i (O ki E ki ) E ki, for example. O ki is the number of observed instances of category i in partition k (i.e. those for which the test has value v k ; see algorithm 1) and E ki the number of expected instances; e.g. E ki = n i C j n kj C j n j For instance, the number of expected instances of category c in a binary categorisation task in a partition k would be E kc = n c n kc + n k c n c + n c, where n c is the number of instances of category c in the parent node and n kc, n k c the numbers of instances of category c and not category c, respectively, in leaf k. The Importance of Pruning Comparing full-size and pruned trees:
Other pruning techniques include minimum description length (MDL) pruning (?), wrapping (incrementally remove nodes, select the tree which gives peak performance on the validation set by cross-validation), etc (?). A final example: WSD Consider the following occurrences of the word bank : RIV y be? Then he ran down along the bank, toward a narrow, muddy path. FIN four bundles of small notes the bank cashier got it into his head RIV ross the bridge and on the other bank you only hear the stream, the RIV beneath the house, where a steep bank of earth is compacted between FIN op but is really the branch of a bank. As I set foot inside, despite FIN raffic police also belong to the bank. More foolhardy than entering FIN require a number. If you open a bank account, the teller identifies RIV circular movement, skirting the bank of the River Jordan, then turn The WSD learning task is to learn to distinguish between meanings financial institution (FIN) and the land alongside a river (RIV). Task definition WSD can be described as a categorisation task where senses (FIN, RIV) are labels (C) the representation of instances (D) comes from the context surrounding the words to be disambiguated. E.g.: For T = {along, cashier, stream, muddy,... }, we could have: d 1 = along = 1, cashier = 0, stream = 0, muddy = 1,... and f(d 1 ) = RIV Performance can be measured as in text categorisation (precision, recall, etc.) 7
A decision tree on RIV river Using the algorithm above (slide 9) we get this decision tree RIV when RIV from money FIN FIN RIV Trained on small training set and T = {small, money, on, to, river, from, in, his, accounts, when, by, other, estuary, some, with} Other topics Cost-sensitive classification; see (?) for a comprehensive survey Alternative attribute selection criteria: gain ratio, distance-based measures etc (?) Regression Missing features...