Simple Classifiers Jialiang Bao, Joseph Boyd, James Forkey, Shengwen Han, Trevor Hodde, Yumou Wang 1
Overview Pruning 2
Section 3.1: Simplicity First Pruning Always start simple! Accuracy can be misleading. 3
Section 3.1: Simplicity First Imagine two datasets: Pruning 4
Section 3.1: Simplicity First Now apply your favorite classifier: Pruning 60% 80 % Which is better? 5
Section 3.1: Simplicity First Compare to simple classifier (ZeroR or OneR): Pruning 60% 10% 80 % 90 % It depends on the dataset 6
Section 3.1: Simplicity First Pruning 60% 10% 60% accuracy is a big improvement over the simple classifier s accuracy of 10% 7
Section 3.1: Simplicity First Pruning 80 % 90 % 80% accuracy for the complex classifier seems good, but its worse than the simple classifier. 8
Section 3.1: Simplicity First Pruning Two simple classifiers: ZeroR Always choose most common value of target class OneR One attribute does all the work 9
Section 3.2: Pruning is a general problem Any ML method may overfit the training data Work well on independent test data? 10
Section 3.2: Example1:Weather.numeric OneR Pruning Making a rule on Temp is quite complex. OneR has a parameter to limit the complexity. 11
Section 3.2: Remove outlook Pruning 12
Section 3.2: Pruning 13
Section 3.2: Pruning 14
Section 3.2: Example2:Diabetes ZeroR 65% Pruning 15
Section 3.2: minbucketsize=1 Pruning pedi 16
Section 3.2: Pruning 17
Section 3.2: Pruning Evaluating on training set is misleading How to choose the best ML method using training, test, validation sets all together 18
Section 3.3: (OneR: One attribute does all the work) Opposite strategy: use all the attributes Naïve Bayes method Two assumptions: Attributes are equally important a priori statistically independent Pruning knowing the value of one attribute says nothing about the value of the other Independence assumption is never correct But often works well in practice 19
Section 3.3: Pruning Bayes method By Thomas Bayes, British mathematician, 1702 1761 Probability of event H given evidence E Pr[ H ] is a priori probability of H Probability of event before evidence is seen Pr[ H E ] is a posteriori probability of H Probability of event after evidence is seen 20
Section 3.3: Naïve assumption: Evidence splits into parts that are independent Pruning 21
Section 3.3: Pruning 22
Section 3.3: Pruning 23
Section 3.3: Pruning 24
Section 3.3: Avoid zero frequencies: start all counts at 1 Pruning 25
Section 3.3: Naïve Bayes : all attributes contribute equally and independently Works surprisingly well Why? classification doesn t need accurate probability estimates so long as the greatest probability is assigned to the correct class Adding redundant attributes causes problems (e.g. identical attributes) -> attribute selection Pruning 26
J48 Top-down recursive divide and conquer Select attribute for root node Split instances into subsets Repeat recursively for each branch The Problem? Find the best attribute to split at each stage 27
How to select the best attribute The quest for purity Find the attribute with the least fluctuation That is, the node we gain the most information from Information theory: measure information gain in bits entropy(p 1,p 2, p n ) = - p 1 logp 1 - p 2 logp 2. p n logp n Information gain Amount of information gained by knowing the value of the attribute (entropy of distribution before the split ) (entropy of distribution after it) 28
Example We want: entropy(play) entropy(play outlook) P(play = yes) = 9/14 P(play = no) =5/14 Entropy(play) = 9/14log 2 (9/14) 5/14 log 2 (5/14) =.940 Find entropy for outlook = sunny, overcast, and rainy Entropy(outlook = sunny) 2/5log 2 (2/5) 3/5log 2 (3/5) =.971 Entropy(outlook = overcast) 4/4log 2 (4/4) 0 log 2 0 = 0 Entropy(outlook = rain) 3/5log 2 (3/5) 2/5log 2 (2/5) =.971 29
Example (cont d) We want: entropy(play) entropy(play outlook) Entropy(play) =.940 Entropy(outlook = sunny) =.971 Entropy(outlook = overcast) = 0 Entropy(outlook = rain).971 Find entropy(play outlook) P(rain) = 5/14 P(overcast = 4/14 P(sunny) = 5/14 Entropy(play outlook) 5/14(.971) (4/14) 5/14(.971) =.694 Entropy(play) entropy(play outlook).940 -.694 =.246 30
Reality Cool. How about a real world decision tree? They re actually pretty intuitive for humans We don t have to worry about choosing attributes J48 does it for us! J48 does it for us! 31
Reality Cool. How about a real world decision tree? They re actually pretty intuitive for humans We don t have to worry about choosing attributes J48 does it for us! Vulnerable to overfitting 32
Section 3.5: Pruning Pruning Decision Tree Pruning is a technique used in machine learning that reduces the size of a decision tree by removing sections that provide little classification power The goal is to reduce complexity of the classifer and improve accuracy through the reduction of overfitting. 33 Section 3.5: Pruning
Section 3.5: Pruning Pruning happens automatically by most classifiers after the tree is constructed Example: Breast Cancer data set with J48 Classifier Pruning Click on the highlighted box to see a list of options to run the classifier with 34 Section 3.5: Pruning
Section 3.5: Pruning Pruning Select an option for unpruned 35 Section 3.5: Pruning
Section 3.5: Pruning Pruning Default, pruned J48 classifier is about 75% accurate 36 Section 3.5: Pruning
Section 3.5: Pruning Pruning Accuracy drops to about 69% when unpruned! 37 Section 3.5: Pruning
Section 3.5: Pruning Pruning Typically, a classifier will stop splitting nodes once they get very small Classifiers will build the full tree and then start working in from the leaves. A statistical test is used to determine which leaves will be pruned Interior nodes can also be pruned to move lower levels of the tree upward (this is the default behavior) How do classifiers do their pruning? Section 3.5: Pruning 38
Section 3.5: Pruning Pruning Over-fitting! Classifier works well on training data, but the independent test data is too complex Sometimes a decision tree is too complex Simplifying the tree could improve accuracy as well as performance Pruning is not restricted to trees! Can be used on other data structures to improve performance Why Pruning? Section 3.5: Pruning 39
Section 3.6: Pruning Rote learning: simplest form of learning To classify a new instance, search training set for one that is most like it The instances themselves are the knowledge lazy learning: do nothing till you make predictions No decision tree 40
Section 3.6: Same class Pruning 41
Section 3.6: Search training set for one that is most like it Pruning Thus, we need a similarity function Training data young: 20, 22, 28, 30 middle-age: 36, 38, 65, 56 elderly: 75, 78, 95, 100 Average 25 45 85 Test data 19 40 83 42
Section 3.6: What is noisy instances? Pruning They are incorrect instances in the training set 43
Section 3.6: Pruning 44
Section 3.6: Pruning Advantages: Accurate because it will make predictions based on current data Disadvantages: Speed slow because it will recalculate each time 45
Thank you! Any Questions? 46