Classification: Rules. Prof. Pier Luca Lanzi Laurea in Ingegneria Informatica Politecnico di Milano Polo regionale di Como

Metodologie per Sistemi Intelligenti Clssifiction: Prof. Pier Luc Lnzi Lure in Ingegneri Informtic Politecnico di Milno Polo regionle di Como Rules

Lecture outline Why rules? Wht re clssifiction rules? Which type of rules? Sequentil Covering Algorithms

Why rules? One of the most expressive nd most humn redle representtion for hypotheses is sets of IF-THEN rules

The Wether dtset gin! No True High Mild Riny Flse Norml Hot Overcst True High Mild Overcst True Norml Mild Sunny Flse Norml Mild Riny Flse Norml Cool Sunny No Flse High Mild Sunny True Norml Cool Overcst No True Norml Cool Riny Flse Norml Cool Riny Flse High Mild Riny Flse High Hot Overcst No True High Hot Sunny No Flse High Hot Sunny Ply Windy Humidity Temp Outlook

Clssifiction rules for the Wether dtset Rule 1: outlook = overcst -> clss Ply [70.7%] Rule 2: outlook = rin windy = flse -> clss Ply [63.0%] Rule 3: outlook = sunny humidity = high -> clss Don't Ply [63.0%] Rule 4: outlook = rin windy = true -> clss Don't Ply [50.0%] Defult clss: Ply

Wht re clssifiction rules? They re IF-THEN rules The IF prt sttes condition over the dt The THEN prt includes clss lel Which type of conditions? Propositionl, with ttriute-vlue comprisons First order Horn cluses, with vriles

Wht re the methods? Method 1: lern decision tree, then convert to rules Method 2: sequentil covering lgorithm

Sequentil covering lgorithms Consider the set E of positive nd negtive exmples Repet Lern one rule with high ccurcy, ny coverge Remove positive exmples covered y this rule Until ll the exmples re covered

Sequentil covering lgorithms

Exploring the Hypothesis Spce Generl to Specific Strt with the most generl hypothesis nd then go on through speciliztion steps Specific to Generl Strt with the set of the most specific hypothesis nd then go on through generliztion steps

Lern-one-rule

Exmple: generting rule y x If true then clss =

Exmple: generting rule, II y x y x 1 2 If x > 1.2 then clss = If true then clss =

Exmple: generting rule, III y x y 1 2 x y 2 6 1 2 x If true then clss = If x > 1.2 nd y > 2.6 then clss = If x > 1.2 then clss =

Exmple: generting rule, IV y x y 1 2 x y 2 6 1 2 x If true then clss = If x > 1.2 nd y > 2.6 then clss = If x > 1.2 then clss = Possile rule set for clss : If x 1.2 then clss = If x > 1.2 nd y 2.6 then clss = More rules could e dded for perfect rule set

Rules vs. trees Corresponding decision tree: (produces exctly the sme predictions) But: rule sets cn e more cler when decision trees suffer from replicted sutrees Also: in multi-clss situtions, covering lgorithm concentrtes on one clss t time wheres decision tree lerner tkes ll clsses into ccount

Rules vs. trees Sequentil covering genertes rule y dding tests tht mximize rule s ccurcy Similr to sitution in decision trees: prolem of selecting n ttriute to split on But decision tree inducer mximizes overll purity Ech new test reduces rule s coverge spce of exmples rule so fr rule fter dding new term

Lern-one-rule Strt from the most generl rule, consisting of n empty condition Add tests on single ttriutes until the performnce (the ccurcy) improves

Lern-one-rule

Exploring the hypothesis spce The lgorithm to explore the hypothesis spce is greedy nd might tend to locl optim To improve the explortion of the hypothesis spce, we cn em serch: t ech step k cndidte hypotheses re considered.

Another exmple: contct lens dt Rule we seek: Possile tests: If? then recommendtion = hrd Age = Young Age = Pre-presyopic Age = Presyopic Spectcle prescription = Myope Spectcle prescription = Hypermetrope Astigmtism = no Astigmtism = yes Ter production rte = Reduced Ter production rte = Norml 2/8 1/8 1/8 3/12 1/12 0/12 4/12 0/12 4/12

Modified rule nd resulting dt Rule with est test dded: Instnces covered y modified rule: Age Young Young Young Young Pre-presyopic Pre-presyopic Pre-presyopic Pre-presyopic Presyopic Presyopic Presyopic Presyopic Spectcle prescription Myope Myope Hypermetrope Hypermetrope Myope Myope Hypermetrope Hypermetrope Myope Myope Hypermetrope Hypermetrope Astigmtism If stigmtism = yes then recommendtion = hrd Ter production rte Reduced Norml Reduced Norml Reduced Norml Reduced Norml Reduced Norml Reduced Norml Recommended lenses None Hrd None hrd None Hrd None None None Hrd None None

Further refinement Current stte: If stigmtism = yes nd? then recommendtion = hrd Possile tests: Age = Young Age = Pre-presyopic Age = Presyopic Spectcle prescription = Myope Spectcle prescription = Hypermetrope Ter production rte = Reduced Ter production rte = Norml 2/4 1/4 1/4 3/6 1/6 0/6 4/6

Modified rule nd resulting dt Rule with est test dded: If stigmtism = yes nd ter production rte = norml then recommendtion = hrd Instnces covered y modified rule: Age Young Young Pre-presyopic Pre-presyopic Presyopic Presyopic Spectcle prescription Myope Hypermetrope Myope Hypermetrope Myope Hypermetrope Astigmtism Ter production rte Norml Norml Norml Norml Norml Norml Recommended lenses Hrd hrd Hrd None Hrd None

Further refinement Current stte: If stigmtism = yes nd ter production rte = norml nd? then recommendtion = hrd Possile tests: Age = Young Age = Pre-presyopic Age = Presyopic Spectcle prescription = Myope Spectcle prescription = Hypermetrope Tie etween the first nd the fourth test We choose the one with greter coverge 2/2 1/2 1/2 3/3 1/3

The result Finl rule: If stigmtism = yes nd ter production rte = norml nd spectcle prescription = myope then recommendtion = hrd Second rule for recommending hrd lenses : (uilt from instnces not covered y first rule) If ge = young nd stigmtism = yes nd ter production rte = norml then recommendtion = hrd These two rules cover ll hrd lenses : Process is repeted with other two clsses

Selecting test Gol: mximize ccurcy t totl numer of instnces covered y rule p positive exmples of the clss covered y rule t p numer of errors mde y rule Simple pproch: select test tht mximizes the rtio p/t We re finished when p/t = 1 or the set of instnces cn t e split ny further

*Test selection criteri Bsic covering lgorithm: keep dding conditions to rule to improve its ccurcy Add the condition tht improves ccurcy the most Mesure 1: p/t t totl instnces covered y rule p numer of these tht re positive Produce rules tht don t cover negtive instnces, s quickly s possile My produce rules with very smll coverge specil cses or noise? Mesure 2: Informtion gin p (log(p/t) log(p/t)) P nd T the positive nd totl numers efore the new condition ws dded Informtion gin emphsizes positive rther thn negtive instnces These interct with the pruning mechnism used

*Missing vlues, numeric ttriutes Common tretment of missing vlues: for ny test, they fil Algorithm must either use other tests to seprte out positive instnces leve them uncovered until lter in the process In some cses it s etter to tret missing s seprte vlue Numeric ttriutes re treted just like they re in decision trees

*Pruning rules Two min strtegies: Incrementl pruning Glol pruning Other difference: pruning criterion Error on hold-out set (reduced-error pruning) Sttisticl significnce MDL principle Also: post-pruning vs. pre-pruning

The RISE lgorithm It works in specific-to-generl pproch Initilly, it cretes one rule for ech trining exmple Then it goes on through elementry generliztion steps until the overll ccurcy does not decrese

The RISE lgorithm Input: ES is the trining set Let RS e ES Compute Acc(RS) Repet For ech rule R in RS, find the nerest exmple E not covered y R (E is of the sme clss s R) R = MostSpecificGenerliztion(R,E) RS = RS with R replced y R if (Acc(RS ) Acc(RS)) then RS=RS if R is identicl to nother rule in RS then,delete R from RS Until no increse in Acc(RS) is otined

Inferring rudimentry rules 1R: lerns 1-level decision tree I.e., rules tht ll test one prticulr ttriute Bsic version One rnch for ech vlue Ech rnch ssigns most frequent clss Error rte: proportion of instnces tht don t elong to the mjority clss of their corresponding rnch Choose ttriute with lowest error rte (ssumes nominl ttriutes)

Pseudo-code for 1R For ech ttriute, For ech vlue of the ttriute, mke rule s follows: count how often ech clss ppers find the most frequent clss mke the rule ssign tht clss to this ttriute-vlue Clculte the error rte of the rules Choose the rules with the smllest error rte Note: missing is treted s seprte ttriute vlue

Evluting the wether ttriutes Outlook Sunny Sunny Overcst Riny Riny Riny Overcst Sunny Sunny Riny Sunny Temp Hot Hot Hot Mild Cool Cool Cool Mild Cool Mild Mild Humidity High High High High Norml Norml Norml High Norml Norml Norml Windy Flse True Flse Flse Flse True True Flse Flse Flse True Ply No No No No Attriute Outlook Temp Humidity Windy Rules Sunny No Overcst Riny Hot No* Mild Cool High No Norml Flse True No* Errors 2/5 0/4 2/5 2/4 2/6 1/4 3/7 1/7 2/8 3/6 Totl errors 4/14 5/14 4/14 5/14 Overcst Overcst Mild Hot High Norml True Flse * indictes tie Riny Mild High True No

Deling with numeric ttriutes Discretize numeric ttriutes Divide ech ttriute s rnge into intervls Sort instnces ccording to ttriute s vlues Plce rekpoints where the clss chnges (the mjority clss) This minimizes the totl error

Deling with numeric ttriutes Exmple: temperture from wether dt Outlook Temperture Humidity Windy Ply Sunny 85 85 Flse No Sunny 80 90 True No Overcst 83 86 Flse Riny 75 80 Flse 64 65 68 69 70 71 72 72 75 75 80 81 83 85 No No No No No

The prolem of overfitting This procedure is very sensitive to noise One instnce with n incorrect clss lel will proly produce seprte intervl Also: ID -like ttriutes will hve zero errors Simple solution: enforce minimum numer of instnces in mjority clss per intervl

Discretiztion exmple Exmple (with min = 3): 64 65 68 69 70 71 72 72 75 75 80 81 83 85 No No No No No Finl result for temperture ttriute 64 65 68 69 70 71 72 72 75 75 80 81 83 85 No No No No No

With overfitting voidnce Resulting rule set: Attriute Rules Errors Totl errors Outlook Sunny No 2/5 4/14 Overcst 0/4 Riny 2/5 Temperture 77.5 3/10 5/14 > 77.5 No* 2/4 Humidity 82.5 1/7 3/14 > 82.5 nd 95.5 No 2/6 > 95.5 0/1 Windy Flse 2/8 5/14 True No* 3/6

Discussion of 1R 1R ws descried in pper y Holte (1993) Contins n experimentl evlution on 16 dtsets (using cross-vlidtion so tht results were representtive of performnce on future dt) Minimum numer of instnces ws set to 6 fter some experimenttion 1R s simple rules performed not much worse thn much more complex decision trees Simplicity first pys off!

The Monks 1 dtset A1 A2 A3 A4 A5 A5 CLASS 1 1 1 1 3 1 1 1 1 1 1 3 2 1 1 1 1 3 2 1 1 1 1 1 3 3 2 1 1 1 2 3 1 2 1 1 2 1 1 1 2 1 1 2 1 1 2 1 0 1 2 1 1 3 1 0 1 2 1 1 4 2 0 1 2 1 2 1 1 1 1 2 1 2 3 1 0 1 2 1 2 3 2 0

Monks 1: decision tree ttriute#5 = 1: 1 (29.0/1.4) ttriute#5 = 2: 0 (31.0/13.4) ttriute#5 = 3: ttriute#6 = 1: 0 (13.0/4.7) ttriute#6 = 2: ttriute#3 = 1: 1 (7.0/3.4) ttriute#3 = 2: 0 (10.0/4.6) ttriute#5 = 4: ttriute#1 = 1: 0 (14.0/2.5) ttriute#1 = 2: ttriute#2 = 1: 0 (6.0/1.2) ttriute#2 = 2: 1 (4.0/1.2) ttriute#2 = 3: 0 (1.0/0.8) ttriute#1 = 3: ttriute#2 = 1: 1 (0.0) ttriute#2 = 2: 0 (3.0/1.1) ttriute#2 = 3: 1 (6.0/1.2)

Monks 1: clssifiction rules Rule 1: ttriute#5 = 1 -> clss 1 [95.3%] Rule 20: ttriute#1 = 3 ttriute#2 = 3 -> clss 1 [92.2%] Rule 17: ttriute#1 = 2 ttriute#2 = 2 -> clss 1 [91.2%] Rule 7: ttriute#1 = 1 ttriute#2 = 1 -> clss 1 [85.7%] Rule 14: ttriute#1 = 1 ttriute#5 = 4 -> clss 0 [82.2%] Rule 16: ttriute#1 = 2 ttriute#2 = 1 ttriute#5 = 4 -> clss 0 [79.4%] Defult clss: 0

Monks 1: nother solution (A1 = A2) OR (A5=1) Decision trees nd clssifiction rules do not include vriles, ut only propositions

Wht re Horn cluses? An Horn cluse is n expression of the form: H L 1 L 2 L 3 L n Where H, L 1 L n re positive literls (predictes pplied to set of terms) H is clled hed or consequent L 1 L 2 L 3 L n is clled ody or ntecedents Exmple: dughter(x,y) fther(y,x) femle(x)

Lerning First Order Rules: FOIL Extends the typicl sequentil covering lgorithms to the lerning of first order rules, or Horn cluses The most known lgorithm is FOIL tht employs n pproch very similr to the sequentil-covering nd lern-one-rule lgorithms. FOIL rules re more restricted thn generl Horn cluses (literls cnnot contin functions) FOIL rules re more expressive thn Horn cluses ecuse the literls ppering in the ody my e negted

The FOIL lgorithm

Monks 1: FOIL Cluse 0: is_0(a,b,c,d,e,f) :- A<>B, E<>A1.

Other pproches Sequentil-covering lgorithms re one of the possile pproches Clssifiction rules cn e lso derived from other representtions such s Decision trees Assocition Rules Neurl Networks Alterntively, clssifiction rules cn e derived through other serch pproches, such s Genetic Algorithms nd Genetic Progrmming.

Summry Clssifiction rules re used ecuse more expressive nd more humn redle Most of the lgorithms re sed on sequentil covering which cn e used oth to derive propositionl nd first order rules Other pproches exist Specific to generl explortion (RISE) Post processing of neurl networks, ssocition rules, decision trees, etc.

References Roert C. Holte, Very Simple Clssifiction Rules Perform Well on Most Commonly Used Dtsets. Computer Science Deprtment, University of Ottw Two-Wy Induction. Proceedings of the Seventh IEEE Interntionl Conference on Tools with Artificil Intelligence (pp. 182-189), 1995. Herndon, VA: IEEE Computer Society Press. Rule Induction nd Instnce-Bsed Lerning: A Unified Approch. Proceedings of the Fourteenth Interntionl Joint Conference on Artificil Intelligence (pp. 1226-1232), 1995. Montrel, Cnd: Morgn Kufmnn. The RISE System: Conquering Without Seprting. Proceedings of the Sixth IEEE Interntionl Conference on Tools with Artificil Intelligence (pp. 704-707), 1994. New Orlens, LA: IEEE Computer Society Press.