Symbolic methods in TC: Decision Trees

Similar documents
Symbolic methods in TC: Decision Trees

Learning Decision Trees

Learning Decision Trees

Machine Learning 2nd Edi7on

Decision Trees. Tirgul 5

Lecture 3: Decision Trees

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Lecture 3: Decision Trees

Decision Trees / NLP Introduction

Data classification (II)

Decision Tree Learning and Inductive Inference

Decision Trees. Danushka Bollegala

CS 6375 Machine Learning

Learning Classification Trees. Sargur Srihari

Decision Trees Part 1. Rao Vemuri University of California, Davis

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Classification and Prediction

Decision Tree Learning

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Decision Trees.

Classification: Decision Trees


Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Dan Roth 461C, 3401 Walnut

The Solution to Assignment 6

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Classification Using Decision Trees

Decision Trees. Gavin Brown

Decision Trees.

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Decision Tree Learning

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Rule Generation using Decision Trees

Tutorial 6. By:Aashmeet Kalra

Artificial Intelligence Decision Trees

Administrative notes. Computational Thinking ct.cs.ubc.ca

Decision Support. Dr. Johan Hagelbäck.

Machine Learning & Data Mining

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

the tree till a class assignment is reached

Decision Tree Learning

Chapter 3: Decision Tree Learning

Artificial Intelligence. Topic

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

EECS 349:Machine Learning Bryan Pardo

Decision Tree Learning - ID3

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Classification and Regression Trees

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Introduction to Machine Learning CMU-10701

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Induction on Decision Trees

Chapter 6: Classification

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Modern Information Retrieval

Modern Information Retrieval

Machine Learning Alternatives to Manual Knowledge Acquisition

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

Bayesian Classification. Bayesian Classification: Why?

The Naïve Bayes Classifier. Machine Learning Fall 2017

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Algorithms for Classification: The Basic Methods

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

CS145: INTRODUCTION TO DATA MINING

Apprentissage automatique et fouille de données (part 2)

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Machine Learning for NLP: Supervised learning techniques. Outline. Saturnino Luz. ESSLLI 07 Dublin Ireland 1-1. Notes. Notes

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Decision Tree Learning

Typical Supervised Learning Problem Setting

Data Mining. Chapter 1. What s it all about?

Inductive Learning. Chapter 18. Why Learn?

Notes on Machine Learning for and

Mining Classification Knowledge

The Quadratic Entropy Approach to Implement the Id3 Decision Tree Algorithm

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Classification and regression trees

Decision Trees Entropy, Information Gain, Gain Ratio

Classification II: Decision Trees and SVMs

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

Applied Logic. Lecture 4 part 2 Bayesian inductive reasoning. Marcin Szczuka. Institute of Informatics, The University of Warsaw

[read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] General-to-specific ordering over hypotheses

Decision Tree Learning Lecture 2

Decision Tree Learning

Classification: Decision Trees

CSCE 478/878 Lecture 6: Bayesian Learning

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Data Mining Part 4. Prediction

Transcription:

Symbolic methods in TC: Decision Trees ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs0/ kevin.koidl@scss.tcd.ie, maldonaa@tcd.ie 01-017 A symbolic method: decision trees Symbolic methods offer the advantage that their classification decisions are easily interpretable (by humans). outlook sunny overcast rainy Decision trees: data represented as vectors of discrete-valued (or discretised) attributes classification through binary tests on highly discriminative features. no humidity high yes windy normal true false yes no yes test sequence encoded as a tree structure. A Sample data set outlook temperature humidity windy play 1 sunny hot high false no sunny hot high true no 3 overcast hot high false yes rainy mild high false yes 5 rainy cool normal false yes rainy cool normal true no 7 overcast cool normal true yes sunny mild high false no 9 sunny cool normal false yes 10 rainy mild normal false yes 11 sunny mild normal true yes 1 overcast mild high true yes 13 overcast hot normal false yes 1 rainy mild high true no (?) 1

Divide-and-conquer learning strategy Choose the features that better divide the instance space. E.g. distribution of attributes for the tennis weather task: outlook temperature humidity windy play=yes play=no 0 overcast rainy sunny 0 cool hot mild 0 high normal 0 false true Uses of decision trees in NLP Parsing (??) Text categorisation (?) word-sense disambiguation, POS tagging, Speech recognition, Etc... What s a Decision Tree? A decision tree is a graph with: internal nodes labelled by terms edges labelled by tests (on the weight the term from which they depart has in the document) leaves labelled by categories Given a decision tree T, categorisation of a document d j is done by recursively testing the weights of the internal nodes of T against those in d j until a leaf is reached. Simplest case: dj consists of Boolean (or binary) weights...

A Text Categorisation Example wheat wheat farm farm commodity commodity bushels bushels agriculture agriculture export export tonnes tonnes winter winter soft soft if (wheat farm) (wheat commodity) (bushels export) (wheat tonnes) (wheat agriculture) (wheat soft) then else ( ) Building Decision Trees Divide-and-conquer strategy 1. check whether all d j have the same label. if not, select t k, partition T r into classes of documents with the same value for t k, and place each class under a subtree 3. recur on each subtree until each leaf contains training examples assigned with the same category c i. label each leaf with its respective c i Some decision tree packages that have been used in TC: ID3, C.5, C5 Decision tree algorithm Algorithm 1: Decision tree learning 1 DTreeLearn (T r: D, T : T, default : C): tree / * T r i s t h e t r a n i n i n g s e t, T i s t h e f e a t u r e s e t * / 3 if isempty (T r ) then return default 5 else if c j s.t. d i T r f(d i, c j ) = 1 then / * a l l d i h a v e c l a s s c j * / return c j 7 else if isempty (T ) then return MajorityCateg (T r ) 9 else 10 t best ChooseFeature (T, T r ) 11 tree new dtree with root = t best 1 for each v k t best do 13 T r k {d l T r t best has value v k in d l } 1 sbt DTreeLearn (T r k, T \ {t best }, MajorityCateg (T r k )) 15 add a branch to tree with label v k and subtree sbt 1 done 17 return tree 3

Important Issues Choosing the right feature (from T ) to partition the training set Choose feature with highest information gain Avoiding overfitting: Memorising all observations from the T r vs. Extracting patterns, extrapolating to unseen examples in T v and T e Occam s razor: the most likely hypothesis is the simplest one which is consistent with all observations Inductive bias of DT learning shorter trees are preferred to larger trees. How do we Implement ChooseFeature? Finding the right feature to partition the feature set is essential. One can use Information Gain (the difference of the entropy of the mother node and the weighted sum of the entropies of the child nodes) yielded by candidate features T : G(T, D) = H(D) p(t i)h(d ti ) (1) t i values(t ) where H(D) is the information entropy of the category distribution on dataset D, that is, for random variable C with values c 1,..., c C with PMF p(c). The sum over values of T is called expected entropy (?), and can be written as E(T ) = t i values(t ) C p(t i) p(c j t i) log P (c j t j) Recall that Entropy (the H(.) function, above), AKA self-information, measures the amount of uncertainty w.r.t a probability distribution. In other words, entropy is a measure of how much we learn when we observe an event occurring in accordance with this distribution. Specifically, for Boolean-valued features For example, the Information Gain yielded by candidate features t k which can take Boolean values (0 or 1) with respect to a binary categorisation task is given by: G(T, D) = H(D) [ Dt D t H(Dt) + D D H(D t)] () where D t and D t are the subsets of D containing instances for which T has value 1 and 0, respectively. j And H(D) = Dc D Dc log D D c D c log D D where D c ( D c ) is the number of positive (negative) instances filed under category c in D. (3)

Numeric vector representations Task: classify REUTERS texts as belonging to category earnings (or not). [...]<title> Cobanco Inc year net</title> <body>shr 3 cts vs 1.19 dlrs Net 07,000 vx,5,000 Assets 510. mln vs 79 mln Deposits 7 mln vs 0 mln Loans 99. mln vs 37 mln Note: th qtr not available. Year includes 195 extraordinary gain from tax carry forward of 13,000 dlrs, or five cts per shr</body>... T =<vs, mln, cts, ;, &, 000, loss,,, 3, profit, dlrs, 1, pct, is, s, that, net, lt, at> d j =< 5, 5, 3, 3, 3,, 0, 0, 0,, 0, 3,, 0, 0, 0, 0, 3,, 0> Creating the text vectors The feature set, T, can be selected (reduced) via one of the information theoretic functions we have seen. Document frequency, G, or χ, for example. We could assign a weight (w ij ) to each feature as follows (?): ( w ij = round 10 1 + 1 + log Features for partitioning T r can be selected by discretising w ij values (see (?) for a commonly used method) and applying G as shown above. A DT for the earnings category 1 71 documents P(c n1) = 0.3 cts = Decision boundaries cts < cts >= net 1 1 n cts 5977 documents P(c n) = 0.11 net = 1 5 170 documents P(c n5) = 0.93 vs = net < 1 net >= 1 vs < vs >= 3 53 documents P(c n3) = 0.05 51 documents P(c n) = 0.9 301 documents P(c n) = 0.9 7 103 documents P(c n7) = 0.99 Probability that a document at node n belongs to category c = "earnings" Calculating node probabilities One can assign probabilities to a leaf node (i.e. the probability that a new document d belonging to that node should be filled under category c) as follows (using add-one smoothing): P (c d n ) = D cn + 1 D cn + D cn + 1 + 1 (5) 5

where Pruning P (c d n ) is the probability that a document d n which ended up in node n belongs to category c, D cn ( D cn ) number of (training) documents in node n which have been assigned category c ( c) Large trees tend to overfit the data. Pruning (i.e. removal or overspecific nodes) often helps produce better models. A commonly used approach: 1. Build a full decision tree. For each node of height 1 (a) test for statistical significance wrt to leaf nodes (b) remove if expected class distribution (given the parent) is not signicantly different from observed class distribution (c) accept node otherwise 3. until all nodes of height 1 have been tested Significance test could be χ = n C k i (O ki E ki ) E ki, for example. O ki is the number of observed instances of category i in partition k (i.e. those for which the test has value v k ; see algorithm 1) and E ki the number of expected instances; e.g. E ki = n i C j n kj C j n j For instance, the number of expected instances of category c in a binary categorisation task in a partition k would be E kc = n c n kc + n k c n c + n c, where n c is the number of instances of category c in the parent node and n kc, n k c the numbers of instances of category c and not category c, respectively, in leaf k. The Importance of Pruning Comparing full-size and pruned trees:

Other pruning techniques include minimum description length (MDL) pruning (?), wrapping (incrementally remove nodes, select the tree which gives peak performance on the validation set by cross-validation), etc (?). A final example: WSD Consider the following occurrences of the word bank : RIV y be? Then he ran down along the bank, toward a narrow, muddy path. FIN four bundles of small notes the bank cashier got it into his head RIV ross the bridge and on the other bank you only hear the stream, the RIV beneath the house, where a steep bank of earth is compacted between FIN op but is really the branch of a bank. As I set foot inside, despite FIN raffic police also belong to the bank. More foolhardy than entering FIN require a number. If you open a bank account, the teller identifies RIV circular movement, skirting the bank of the River Jordan, then turn The WSD learning task is to learn to distinguish between meanings financial institution (FIN) and the land alongside a river (RIV). Task definition WSD can be described as a categorisation task where senses (FIN, RIV) are labels (C) the representation of instances (D) comes from the context surrounding the words to be disambiguated. E.g.: For T = {along, cashier, stream, muddy,... }, we could have: d 1 = along = 1, cashier = 0, stream = 0, muddy = 1,... and f(d 1 ) = RIV Performance can be measured as in text categorisation (precision, recall, etc.) 7

A decision tree on RIV river Using the algorithm above (slide 9) we get this decision tree RIV when RIV from money FIN FIN RIV Trained on small training set and T = {small, money, on, to, river, from, in, his, accounts, when, by, other, estuary, some, with} Other topics Cost-sensitive classification; see (?) for a comprehensive survey Alternative attribute selection criteria: gain ratio, distance-based measures etc (?) Regression Missing features...