Machine Learning for Language Technology Lecture 8: Decision Trees and k- Nearest Neighbors

Size: px

Start display at page:

Download "Machine Learning for Language Technology Lecture 8: Decision Trees and k- Nearest Neighbors"

Amy Moore
5 years ago
Views:

1 Machne Learnng for Language Technology Lecture 8: Decson Trees and k- Nearest Neghbors Marna San:n Department of Lngus:cs and Phlology Uppsala Unversty, Uppsala, Sweden Autumn 2014 Acknowledgement: Thanks to Prof. Joakm Nvre for course desgn and materals 1

2 Supervsed Classfca:on Dvde nstances nto (two or more) classes Instance (feature vector): Features may be categorcal or numercal Class (label): Tranng data: y X = {x t,y t N } t=1 x = x 1,, x m Classfca:on n Language Technology Spam flterng (spam vs. non- spam) Spellng error detec:on (error vs. no error) Text categorza:on (news, economy, culture, sport,...) Named en:ty classfca:on (person, loca:on, organza:on,...) 2

3 Models for Classfca:on Genera:ve probabls:c models: Model of P(x, y) Nave Bayes Cond:onal probabls:c models: Model of P(y x) Logs:c regresson Dscrmna:ve model: No explct probablty model Decson trees, nearest neghbor classfca:on Perceptron, support vector machnes, MIRA 3

4 Repea:ng Nose Data cleanng s expensve and :me consumng Margn Induc:ve bas Types of nductve bases Mnmum cross- valdaton error Maxmum margn Mnmum descrpton length [...] 4

5 DECISION TREES 5

Decson Trees Herarchcal tree structure for classfca:on Each nternal node specfes a test of some feature Each branch corresponds to a value for the tested feature Each leaf node provdes a

6 Decson Trees Herarchcal tree structure for classfca:on Each nternal node specfes a test of some feature Each branch corresponds to a value for the tested feature Each leaf node provdes a classfca:on for the nstance Represents a dsjunc:on of conjunc:ons of constrants Each path from root to leaf specfes a conjunc:on of tests The tree tself represents the dsjunc:on of all paths 6

7 Decson Tree 7

8 Dvde and Conquer Internal decson nodes Unvarate: Uses a sngle a]rbute, x Numerc x : Bnary splt : x > w m Dscrete x : n- way splt for n possble values Mul:varate: Uses all a]rbutes, x Leaves Classfca:on: class labels (or propor:ons) Regresson: r average (or local ft) Learnng: Greedy recursve algorthm Fnd best splt X = (X 1,..., X p ), then nduce tree for each X 8

9 Classfca:on Trees (ID3, CART, C4.5) For node m, N m nstances reach m, N m belong to C ˆP ( C x, m) p m = N m N m Node m s pure f p m s 0 or 1 Measure of mpurty s entropy I m = K =1 p m log 2 p m 9

10 Example: Entropy Assume two classes (C 1, C 2 ) and four nstances (x 1, x 2, x 3, x 4 ) Case 1: C 1 = {x 1, x 2, x 3, x 4 }, C 2 = { } I m = (1 log log 0) = 0 Case 2: C 1 = {x 1, x 2, x 3 }, C 2 = {x 4 } I m = (0.75 log log 0.25) = 0.81 Case 3: C 1 = {x 1, x 2 }, C 2 = {x 3, x 4 } I m = (0.5 log log 0.5) = 1 10

11 Best Splt If node m s pure, generate a leaf and stop, otherwse splt wth test t and con:nue recursvely Fnd the test that mnmzes mpurty Impurty ager splt wth test t: N mj of N m take branch j N mj belong to C ˆP ( C x, m, j) p mj = N mj N mj I m = K =1 p m log 2 p m ˆP ( C x, m) p m = N m N m I m t = n j=1 N mj N m K =1 p log p mj 2 mj 11

12 12

13 Informa:on Gan We want to determne whch a]rbute n a gven set of tranng feature vectors s most useful for dscrmna:ng between the classes to be learned. Informa:on gan tells us how mportant a gven a]rbute of the feature vectors s. We wll use t to decde the orderng of a]rbutes n the nodes of a decson tree. 13

14 Informa:on Gan and Gan Ra:o Choosng the test that mnmzes mpurty maxmzes the nforma:on gan (IG): Informa:on gan prefers features wth many values IG m t I m t = I m I m t = I m = n N mj j=1 N m K =1 K =1 p m log 2 p m p mj log 2 p mj The normalzed verson s called gan ra:o (GR): V m t GR m t = n j=1 = IG m t V m t N mj N m log 2 N mj N m 14

15 Prunng Trees Decson trees are suscep:ble to overfjng Remove subtrees for be]er generalza:on: Preprunng: Early stoppng (e.g., wth entropy threshold) Postprunng: Grow whole tree, then prune subtrees Preprunng s faster, postprunng s more accurate (requres a separate valda:on set) 15

16 Rule Extrac:on from Trees C4.5Rules (Qunlan, 1993) 16

17 Learnng Rules Rule nduc:on s smlar to tree nduc:on but tree nduc:on s breadth- frst rule nduc:on s depth- frst (one rule at a :me) Rule learnng: A rule s a conjunc:on of terms (cf. tree path) A rule covers an example f all terms of the rule evaluate to true for the example (cf. sequence of tests) Sequen:al coverng: Generate rules one at a :me un:l all pos:ve examples are covered IREP (Fürnkrantz and Wdmer, 1994), Rpper (Cohen, 1995) 17

18 Proper:es of Decson Trees Decson trees are approprate for classfca:on when: Features can be both categorcal and numerc Dsjunc:ve descrp:ons may be requred Tranng data may be nosy (mssng values, ncorrect labels) Interpreta:on of learned model s mportant (rules) Induc0ve bas of (most) decson tree learners: 1. Prefers trees wth nforma0ve aerbutes close to the root 2. Prefers smaller trees over bgger ones (wth prunng) 3. Preference bas (ncomplete search of complete space) 18

19 K- NEAREST NEIGHBORS 19

20 Nearest Neghbor Classfca:on An old dea Ths rule of nearest neghbor has consderable elementary ntutve appeal and probably corresponds to practce n many stuatons. For example, t s possble that much medcal dagnoss s nfluenced by the doctor's recollecton of the subsequent hstory of an earler patent whose symptoms resemble n some way those of the current patent. (Fx and Hodges, 1952) Key components: Storage of old nstances Smlarty- based reasonng to new nstances 20

21 k- Nearest Neghbour Learnng: Store tranng nstances n memory Classfca:on: Gven new test nstance x, Compare t to all stored nstances Compute a dstance between x and each stored nstance x t Keep track of the k closest nstances (nearest neghbors) Assgn to x the majorty class of the k nearest neghbours A geometrc vew of learnng Proxmty n (feature) space à same class The smoothness assump:on 21

22 Eager and Lazy Learnng Eager learnng (e.g., decson trees) Learnng nduce an abstract model from data Classfca:on apply model to new data Lazy learnng (a.k.a. memory- based learnng) Learnng store data n memory Classfca:on compare new data to data n memory Proper:es: Retans all the nforma:on n the tranng set no abstrac:on Complex hypothess space sutable for natural language? Man drawback classfca:on can be very neffcent 22

23 Dmensons of a k- NN Classfer Dstance metrc How do we measure dstance between nstances? Determnes the layout of the nstance space The k parameter How large neghborhood should we consder? Determnes the complexty of the hypothess space 23

24 Dstance Metrc 1 Overlap = count of msmatchng features m =1 Δ(x,z) = δ(x,z ) δ ( x, z ) = x max 0 1 z mn f f f numerc, else x x = z z 24

25 Dstance Metrc 2 MVDM = Modfed Value Dfference Metrc m =1 Δ(x,z) = δ(x,z ) K j=1 δ(x,z ) = P(C j x ) P(C j z ) 25

If k = N, all the feature space s one neghborhood k

26 The k parameter Tunes the complexty of the hypothess space If k = 1, every nstance has ts own neghborhood If k = N, all the feature space s one neghborhood k = 1 k = 15 Ê = E(h V ) = M t=1 1( h( x t ) r t ) 26

27 A Smple Example Tranng set: 1. (a, b, a, c) à A 2. (a, b, c, a) à B 3. (b, a, c, c) à C 4. (c, a, b, c) à A New nstance: 5. (a, b, b, a) m =1 x z max mn δ ( x, z ) = 0 1 Δ(x,z) = δ(x,z ) f numerc, else f x f x = z z Dstances (overlap): Δ(1, 5) = 2 Δ(2, 5) = 1 Δ(3, 5) = 4 Δ(4, 5) = 3 k- NN classfca:on: 1- NN(5) = B 2- NN(5) = A/B 3- NN(5) = A 4- NN(5) = A 27

28 Further Vara:ons on k- NN Feature weghts: The overlap metrc gves all features equal weght Features can be weghted by IG or GR Weghted vo:ng: The normal decson rule gves all neghbors equal weght Instances can be weghted by (nverse) dstance 28

29 Proper:es of k- NN Nearest neghbor classfca:on s approprate when: Features can be both categorcal and numerc Dsjunc:ve descrp:ons may be requred Tranng data may be nosy (mssng values, ncorrect labels) Fast classfca:on s not crucal Induc0ve bas of k- NN: 1. Nearby nstances should have the same label (smoothness assump0on) 2. All features are equally mportant (wthout feature weghts) 3. Complexty tuned by the k parameter 29

30 End of Lecture 8 30

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification

Instance-Based Learning (a.k.a. memory-based learning) Part I: Nearest Neighbor Classification Instance-Based earnng (a.k.a. memory-based learnng) Part I: Nearest Neghbor Classfcaton Note to other teachers and users of these sldes. Andrew would be delghted f you found ths source materal useful n