Symbolic methods in TC: Decision Trees

Similar documents
Symbolic methods in TC: Decision Trees

Learning Decision Trees

Learning Decision Trees

Machine Learning 2nd Edi7on

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Classification Using Decision Trees

Rule Generation using Decision Trees

Lecture 3: Decision Trees

CS 6375 Machine Learning

Decision Trees. Danushka Bollegala

Decision Trees. Tirgul 5

Lecture 3: Decision Trees

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Decision Trees / NLP Introduction

Decision Trees.

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Decision Tree Learning

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Classification: Decision Trees

Decision Trees.

Dan Roth 461C, 3401 Walnut

Classification and Prediction

Data classification (II)

Decision Tree Learning and Inductive Inference

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order


Decision Trees Part 1. Rao Vemuri University of California, Davis

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Artificial Intelligence Decision Trees

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Learning Classification Trees. Sargur Srihari

Machine Learning & Data Mining

Decision Tree Learning

Decision Tree Learning

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Administrative notes. Computational Thinking ct.cs.ubc.ca

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

EECS 349:Machine Learning Bryan Pardo

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

The Solution to Assignment 6

Tutorial 6. By:Aashmeet Kalra

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Chapter 6: Classification

Decision Trees. Gavin Brown

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

The Quadratic Entropy Approach to Implement the Id3 Decision Tree Algorithm

Bayesian Classification. Bayesian Classification: Why?

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

Decision Support. Dr. Johan Hagelbäck.

Artificial Intelligence. Topic

the tree till a class assignment is reached

Chapter 3: Decision Tree Learning

Decision Tree. Decision Tree Learning. c4.5. Example

Machine Learning for NLP: Supervised learning techniques. Outline. Saturnino Luz. ESSLLI 07 Dublin Ireland 1-1. Notes. Notes

Induction on Decision Trees

Notes on Machine Learning for and

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Machine Learning Alternatives to Manual Knowledge Acquisition

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Induction of Decision Trees

Decision Tree Learning - ID3

Classification and Regression Trees

Machine Learning for NLP: Unsupervised learning techniques Dept. of Computer Science, Trinity College Dublin, Ireland ESSLLI 07 Dublin Ireland

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Typical Supervised Learning Problem Setting

Introduction to Machine Learning CMU-10701

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

Apprentissage automatique et fouille de données (part 2)

CS145: INTRODUCTION TO DATA MINING

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Decision Trees Entropy, Information Gain, Gain Ratio

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Inductive Learning. Chapter 18. Why Learn?

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Modern Information Retrieval

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Classification II: Decision Trees and SVMs

Modern Information Retrieval

The Naïve Bayes Classifier. Machine Learning Fall 2017

Einführung in Web- und Data-Science

Classification and regression trees

Data Mining. Chapter 1. What s it all about?

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Data Mining Part 4. Prediction

Decision Tree Learning

Applied Logic. Lecture 4 part 2 Bayesian inductive reasoning. Marcin Szczuka. Institute of Informatics, The University of Warsaw

Generalization Error on Pruning Decision Trees

Mining Classification Knowledge

Decision Tree Learning Lecture 2

Classification: Decision Trees

Algorithms for Classification: The Basic Methods

Selected Algorithms of Machine Learning from Examples

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Transcription:

Symbolic methods in TC: Decision Trees ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs4062/ kevin.koidl@scss.tcd.ie, maldonaa@tcd.ie 2016-2017

2 A symbolic method: decision trees Symbolic methods offer the advantage that their classification decisions are easily interpretable (by humans). Decision trees: outlook data represented as vectors of discrete-valued (or discretised) attributes classification through binary tests on highly discriminative features. test sequence encoded as a tree structure. sunny overcast rainy humidity yes windy high normal true false no yes no yes

3 A Sample data set outlook temperature humidity windy play 1 sunny hot high false no 2 sunny hot high true no 3 overcast hot high false yes 4 rainy mild high false yes 5 rainy cool normal false yes 6 rainy cool normal true no 7 overcast cool normal true yes 8 sunny mild high false no 9 sunny cool normal false yes 10 rainy mild normal false yes 11 sunny mild normal true yes 12 overcast mild high true yes 13 overcast hot normal false yes 14 rainy mild high true no [Quinlan, 1986]

4 Divide-and-conquer learning strategy Choose the features that better divide the instance space. E.g. distribution of attributes for the tennis weather task: outlook temperature humidity windy 8 play=yes 8 8 8 play=no 6 6 6 6 4 4 4 4 2 2 2 2 0 overcast rainy sunny 0 cool hot mild 0 high normal 0 false true

Uses of decision trees in NLP Parsing [Haruno et al., 1999, Magerman, 1995] Text categorisation [Lewis and Ringuette, 1994] word-sense disambiguation, POS tagging, Speech recognition, Etc...

What s a Decision Tree? A decision tree is a graph with: internal nodes labelled by terms edges labelled by tests (on the weight the term from which they depart has in the document) leaves labelled by categories Given a decision tree T, categorisation of a document d j is done by recursively testing the weights of the internal nodes of T against those in d j until a leaf is reached. Simplest case: d j consists of Boolean (or binary) weights...

7 A Text Categorisation Example wheat wheat farm farm commodity commodity bushels bushels agriculture agriculture WHEAT export export WHEAT WHEAT WHEAT tonnes tonnes WHEAT WHEAT WHEAT winter winter WHEAT soft soft WHEAT WHEAT

7 A Text Categorisation Example wheat wheat farm farm commodity commodity bushels bushels agriculture agriculture WHEAT export export WHEAT WHEAT WHEAT tonnes tonnes WHEAT WHEAT WHEAT winter winter WHEAT soft WHEAT soft WHEAT if (wheat farm) (wheat commodity) (bushels export) (wheat tonnes) (wheat agriculture) (wheat soft) then WHEAT else ( WHEAT)

Building Decision Trees Divide-and-conquer strategy 1. check whether all d j have the same label 2. if not, select t k, partition Tr into classes of documents with the same value for t k, and place each class under a subtree 3. recur on each subtree until each leaf contains training examples assigned with the same category c i 4. label each leaf with its respective c i Some decision tree packages that have been used in TC: ID3, C4.5, C5

9 Decision tree algorithm Algorithm 1: Decision tree learning 1 DTreeLearn (Tr : 2 D, T : 2 T, default : C ): tree 2 / * Tr i s t h e t r a n i n i n g s e t, T i s t h e f e a t u r e s e t * / 3 if isempty (Tr ) then 4 return default 5 else if c j s.t. d i Tr f (d i, c j ) = 1 then / * a l l d i h a v e c l a s s c j * / 6 return c j 7 else if isempty (T ) then 8 return MajorityCateg (Tr ) 9 else 10 t best ChooseFeature (T, Tr ) 11 tree new dtree with root = t best 12 for each v k t best do 13 Tr k {d l Tr t best has value v k in d l } 14 sbt DTreeLearn (Tr k, T \ {t best }, MajorityCateg (Tr k )) 15 add a branch to tree with label v k and subtree sbt 16 done 17 return tree

Important Issues Choosing the right feature (from T ) to partition the training set Choose feature with highest information gain Avoiding overfitting: Memorising all observations from the Tr vs. Extracting patterns, extrapolating to unseen examples in Tv and Te Occam s razor: the most likely hypothesis is the simplest one which is consistent with all observations Inductive bias of DT learning shorter trees are preferred to larger trees.

Important Issues Choosing the right feature (from T ) to partition the training set Choose feature with highest information gain Avoiding overfitting: Memorising all observations from the Tr vs. Extracting patterns, extrapolating to unseen examples in Tv and Te Occam s razor: the most likely hypothesis is the simplest one which is consistent with all observations Inductive bias of DT learning shorter trees are preferred to larger trees.

Important Issues Choosing the right feature (from T ) to partition the training set Choose feature with highest information gain Avoiding overfitting: Memorising all observations from the Tr vs. Extracting patterns, extrapolating to unseen examples in Tv and Te Occam s razor: the most likely hypothesis is the simplest one which is consistent with all observations Inductive bias of DT learning shorter trees are preferred to larger trees.

How do we Implement ChooseFeature? Finding the right feature to partition the feature set is essential. One can use Information Gain (the difference of the entropy of the mother node and the weighted sum of the entropies of the child nodes) yielded by candidate features T : G(T, D) = H(D) p(t i )H(D ti ) (1) t i values(t ) where H(D) is the information entropy of the category distribution on dataset D, that is, for random variable C with values c 1,..., c C with PMF p(c). The sum over values of T is called expected entropy [Quinlan, 1986], and can be written as E(T ) = t i values(t ) C p(t i ) p(c j t i ) log P(c j t j ) j

12 Specifically, for Boolean-valued features For example, the Information Gain yielded by candidate features t k which can take Boolean values (0 or 1) with respect to a binary categorisation task is given by: G(T, D) = H(D) [ D t D H(D t) + D t D H(D t )] (2) where D t and D t are the subsets of D containing instances for which T has value 1 and 0, respectively. And H(D) = D c D log D c D D c D c log (3) D D where D c ( D c ) is the number of positive (negative) instances filed under category c in D.

Numeric vector representations Task: classify REUTERS texts as belonging to category earnings (or not). [...]<title> Cobanco Inc year net</title> <body>shr 34 cts vs 1.19 dlrs Net 807,000 vx 2,858,000 Assets 510.2 mln vs 479 mln Deposits 472 mln vs 440 mln Loans 299.2 mln vs 327 mln Note: 4th qtr not available. Year includes 1985 extraordinary gain from tax carry forward of 132,000 dlrs, or five cts per shr</body>... T =<vs, mln, cts, ;, &, 000, loss,,, 3, profit, dlrs, 1, pct, is, s, that, net, lt, at> d j =< 5, 5, 3, 3, 3, 4, 0, 0, 0, 4, 0, 3, 2, 0, 0, 0, 0, 3, 2, 0>

Creating the text vectors The feature set, T, can be selected (reduced) via one of the information theoretic functions we have seen. Document frequency, G, or χ 2, for example. We could assign a weight (w ij ) to each feature as follows [Manning and Schütze, 1999]: ( ) 1 + log# j (t i ) w ij = round 10 1 + log T l=1 # j(t l ) Features for partitioning Tr can be selected by discretising w ij values (see [Fayyad and Irani, 1993] for a commonly used method) and applying G as shown above. (4)

15 A DT for the earnings category 1 7681 documents P(c n1) = 0.3 cts = 2 Decision boundaries cts < 2 cts >= 2 net 1 1 2 n4 cts 2 5977 documents P(c n2) = 0.116 net = 1 5 1704 documents P(c n5) = 0.943 vs = 2 net < 1 net >= 1 vs < 2 vs >= 2 3 5436 documents P(c n3) = 0.05 4 541 documents P(c n4) = 0.649 6 301 documents P(c n6) = 0.694 7 1403 documents P(c n7) = 0.996 Probability that a document at node n4 belongs to category c = "earnings"

16 Calculating node probabilities One can assign probabilities to a leaf node (i.e. the probability that a new document d belonging to that node should be filled under category c) as follows (using add-one smoothing): P(c d n ) = D cn + 1 D cn + D cn + 1 + 1 (5) where P(c d n ) is the probability that a document d n which ended up in node n belongs to category c, Dcn ( D cn ) number of (training) documents in node n which have been assigned category c ( c)

17 Pruning Large trees tend to overfit the data. Pruning (i.e. removal or overspecific nodes) often helps produce better models. A commonly used approach: 1. Build a full decision tree 2. For each node of height 1 2.1 test for statistical significance wrt to leaf nodes 2.2 remove if expected class distribution (given the parent) is not signicantly different from observed class distribution 2.3 accept node otherwise 3. until all nodes of height 1 have been tested Significance test could be χ 2 = n k C i (O ki E ki ) 2 E ki, for example. O ki is the number of observed instances of category i in partition k (i.e. those for which the test has value v k ; see algorithm 1) and E ki the number of expected instances; e.g. E ki = n i C j n kj C j n j

The Importance of Pruning Comparing full-size and pruned trees: Other pruning techniques include minimum description length (MDL) pruning [Mehta et al., 1995], wrapping (incrementally remove nodes, select the tree which gives peak performance on the validation set by cross-validation), etc [Mingers, 1989].

A final example: WSD Consider the following occurrences of the word bank : RIV y be? Then he ran down along the bank, toward a narrow, muddy path. FIN four bundles of small notes the bank cashier got it into his head RIV ross the bridge and on the other bank you only hear the stream, the RIV beneath the house, where a steep bank of earth is compacted between FIN op but is really the branch of a bank. As I set foot inside, despite FIN raffic police also belong to the bank. More foolhardy than entering FIN require a number. If you open a bank account, the teller identifies RIV circular movement, skirting the bank of the River Jordan, then turn The WSD learning task is to learn to distinguish between meanings financial institution (FIN) and the land alongside a river (RIV).

Task definition WSD can be described as a categorisation task where senses (FIN, RIV) are labels (C) the representation of instances (D) comes from the context surrounding the words to be disambiguated. E.g.: For T = {along, cashier, stream, muddy,... }, we could have: d 1 = along = 1, cashier = 0, stream = 0, muddy = 1,... and f (d1 ) = RIV Performance can be measured as in text categorisation (precision, recall, etc.)

A decision tree on 1 0 RIV river 1 0 Using the algorithm above (slide 9) we get this decision tree RIV RIV when 1 0 from 1 0 money 1 0 FIN FIN RIV Trained on small training set and T = {small, money, on, to, river, from, in, his, accounts, when, by, other, estuary, some, with}

Other topics Cost-sensitive classification; see [Lomax and Vadera, 2013] for a comprehensive survey Alternative attribute selection criteria: gain ratio, distance-based measures etc [Mitchell, 1997] Regression Missing features...

23 References I Fayyad, U. M. and Irani, K. B. (1993). Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th International Joint Conference in Artificial Intelligence, pages 1022 1026. Morgan Kaufmann. Haruno, M., Shirai, S., and Ooyama, Y. (1999). Using decision trees to construct a practical parser. Machine Learning, 34(1):131 149. Lewis, D. and Ringuette, M. (1994). A comparison of two learning algorithms for text categorization. In Third Annual Symposium on Document Analysis and Information Retrieval, pages 81 93. Lomax, S. and Vadera, S. (2013). A survey of cost-sensitive decision tree induction algorithms. ACM Computing Surveys, 45(2):16:1 16:35. Magerman, D. M. (1995). Statistical decision-tree models for parsing. In Meeting of the Association for Computational Linguistics, pages 276 283. Manning, C. D. and Schütze, H. (1999). Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts.

References II Mehta, M., Rissanen, J., and Agrawal, R. (1995). MDL-based decision tree pruning. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD 95), pages 216 221. Mingers, J. (1989). An empirical comparison of pruning methods for decision tree induction. Machine learning, 4(2):227 243. Mitchell, T. M. (1997). Machine Learning. McGraw-Hill. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1:81 106. 10.1007/BF00116251.