Jialiang Bao, Joseph Boyd, James Forkey, Shengwen Han, Trevor Hodde, Yumou Wang 10/01/2013

Similar documents
Algorithms for Classification: The Basic Methods

Inteligência Artificial (SI 214) Aula 15 Algoritmo 1R e Classificador Bayesiano

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction


Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 4 of Data Mining by I. H. Witten, E. Frank and M. A.

The Solution to Assignment 6

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Decision Trees. Gavin Brown

Slides for Data Mining by I. H. Witten and E. Frank

Classification and Regression Trees

The Naïve Bayes Classifier. Machine Learning Fall 2017

Classification: Decision Trees

Learning Decision Trees

Lecture 3: Decision Trees

EECS 349:Machine Learning Bryan Pardo

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Machine Learning 2nd Edi7on

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Lecture 3: Decision Trees

the tree till a class assignment is reached

CS 6375 Machine Learning

Naïve Bayes Lecture 6: Self-Study -----

Decision Support. Dr. Johan Hagelbäck.

Mining Classification Knowledge

Machine Learning Chapter 4. Algorithms

Learning Decision Trees

Data classification (II)

Data Mining Part 4. Prediction

Classification and Prediction

UVA CS 4501: Machine Learning

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

Decision Trees. Tirgul 5

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Decision Trees. Danushka Bollegala

Decision-Tree Learning. Chapter 3: Decision Tree Learning. Classification Learning. Decision Tree for PlayTennis

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Mining Classification Knowledge

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Decision Trees Entropy, Information Gain, Gain Ratio

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Chapter 3: Decision Tree Learning

Soft Computing. Lecture Notes on Machine Learning. Matteo Mattecci.

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Decision Trees Part 1. Rao Vemuri University of California, Davis

Decision Tree Learning

Bayesian Classification. Bayesian Classification: Why?

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Decision Tree Learning and Inductive Inference

Administrative notes. Computational Thinking ct.cs.ubc.ca

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Decision Tree Learning

Inductive Learning. Chapter 18. Why Learn?

Chapter 6: Classification

Classification Using Decision Trees

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Artificial Intelligence. Topic

Dan Roth 461C, 3401 Walnut

Machine Learning & Data Mining

Rule Generation using Decision Trees

Machine Learning 3. week

CHAPTER-17. Decision Tree Induction

Einführung in Web- und Data-Science

Learning Classification Trees. Sargur Srihari

CS145: INTRODUCTION TO DATA MINING

Introduction to Machine Learning CMU-10701

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

Decision Trees.

CSCE 478/878 Lecture 6: Bayesian Learning

Decision Trees: Overfitting

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

From inductive inference to machine learning

Classification and regression trees

Knowledge Discovery and Data Mining

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Bayesian Learning (II)

Decision Tree Learning

Data Mining and Knowledge Discovery: Practice Notes

Typical Supervised Learning Problem Setting

Decision Trees (Cont.)

COMP61011! Probabilistic Classifiers! Part 1, Bayes Theorem!

Decision Trees.

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Empirical Risk Minimization, Model Selection, and Model Assessment

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Decision Tree Learning - ID3

Chapter 3: Decision Tree Learning (part 2)

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

VBM683 Machine Learning

Decision Tree Learning

Induction on Decision Trees

Decision Trees. None Some Full > No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. No Yes. Patrons? WaitEstimate? Hungry? Alternate?

ML techniques. symbolic techniques different types of representation value attribute representation representation of the first order

Generative v. Discriminative classifiers Intuition

Transcription:

Simple Classifiers Jialiang Bao, Joseph Boyd, James Forkey, Shengwen Han, Trevor Hodde, Yumou Wang 1

Overview Pruning 2

Section 3.1: Simplicity First Pruning Always start simple! Accuracy can be misleading. 3

Section 3.1: Simplicity First Imagine two datasets: Pruning 4

Section 3.1: Simplicity First Now apply your favorite classifier: Pruning 60% 80 % Which is better? 5

Section 3.1: Simplicity First Compare to simple classifier (ZeroR or OneR): Pruning 60% 10% 80 % 90 % It depends on the dataset 6

Section 3.1: Simplicity First Pruning 60% 10% 60% accuracy is a big improvement over the simple classifier s accuracy of 10% 7

Section 3.1: Simplicity First Pruning 80 % 90 % 80% accuracy for the complex classifier seems good, but its worse than the simple classifier. 8

Section 3.1: Simplicity First Pruning Two simple classifiers: ZeroR Always choose most common value of target class OneR One attribute does all the work 9

Section 3.2: Pruning is a general problem Any ML method may overfit the training data Work well on independent test data? 10

Section 3.2: Example1:Weather.numeric OneR Pruning Making a rule on Temp is quite complex. OneR has a parameter to limit the complexity. 11

Section 3.2: Remove outlook Pruning 12

Section 3.2: Pruning 13

Section 3.2: Pruning 14

Section 3.2: Example2:Diabetes ZeroR 65% Pruning 15

Section 3.2: minbucketsize=1 Pruning pedi 16

Section 3.2: Pruning 17

Section 3.2: Pruning Evaluating on training set is misleading How to choose the best ML method using training, test, validation sets all together 18

Section 3.3: (OneR: One attribute does all the work) Opposite strategy: use all the attributes Naïve Bayes method Two assumptions: Attributes are equally important a priori statistically independent Pruning knowing the value of one attribute says nothing about the value of the other Independence assumption is never correct But often works well in practice 19

Section 3.3: Pruning Bayes method By Thomas Bayes, British mathematician, 1702 1761 Probability of event H given evidence E Pr[ H ] is a priori probability of H Probability of event before evidence is seen Pr[ H E ] is a posteriori probability of H Probability of event after evidence is seen 20

Section 3.3: Naïve assumption: Evidence splits into parts that are independent Pruning 21

Section 3.3: Pruning 22

Section 3.3: Pruning 23

Section 3.3: Pruning 24

Section 3.3: Avoid zero frequencies: start all counts at 1 Pruning 25

Section 3.3: Naïve Bayes : all attributes contribute equally and independently Works surprisingly well Why? classification doesn t need accurate probability estimates so long as the greatest probability is assigned to the correct class Adding redundant attributes causes problems (e.g. identical attributes) -> attribute selection Pruning 26

J48 Top-down recursive divide and conquer Select attribute for root node Split instances into subsets Repeat recursively for each branch The Problem? Find the best attribute to split at each stage 27

How to select the best attribute The quest for purity Find the attribute with the least fluctuation That is, the node we gain the most information from Information theory: measure information gain in bits entropy(p 1,p 2, p n ) = - p 1 logp 1 - p 2 logp 2. p n logp n Information gain Amount of information gained by knowing the value of the attribute (entropy of distribution before the split ) (entropy of distribution after it) 28

Example We want: entropy(play) entropy(play outlook) P(play = yes) = 9/14 P(play = no) =5/14 Entropy(play) = 9/14log 2 (9/14) 5/14 log 2 (5/14) =.940 Find entropy for outlook = sunny, overcast, and rainy Entropy(outlook = sunny) 2/5log 2 (2/5) 3/5log 2 (3/5) =.971 Entropy(outlook = overcast) 4/4log 2 (4/4) 0 log 2 0 = 0 Entropy(outlook = rain) 3/5log 2 (3/5) 2/5log 2 (2/5) =.971 29

Example (cont d) We want: entropy(play) entropy(play outlook) Entropy(play) =.940 Entropy(outlook = sunny) =.971 Entropy(outlook = overcast) = 0 Entropy(outlook = rain).971 Find entropy(play outlook) P(rain) = 5/14 P(overcast = 4/14 P(sunny) = 5/14 Entropy(play outlook) 5/14(.971) (4/14) 5/14(.971) =.694 Entropy(play) entropy(play outlook).940 -.694 =.246 30

Reality Cool. How about a real world decision tree? They re actually pretty intuitive for humans We don t have to worry about choosing attributes J48 does it for us! J48 does it for us! 31

Reality Cool. How about a real world decision tree? They re actually pretty intuitive for humans We don t have to worry about choosing attributes J48 does it for us! Vulnerable to overfitting 32

Section 3.5: Pruning Pruning Decision Tree Pruning is a technique used in machine learning that reduces the size of a decision tree by removing sections that provide little classification power The goal is to reduce complexity of the classifer and improve accuracy through the reduction of overfitting. 33 Section 3.5: Pruning

Section 3.5: Pruning Pruning happens automatically by most classifiers after the tree is constructed Example: Breast Cancer data set with J48 Classifier Pruning Click on the highlighted box to see a list of options to run the classifier with 34 Section 3.5: Pruning

Section 3.5: Pruning Pruning Select an option for unpruned 35 Section 3.5: Pruning

Section 3.5: Pruning Pruning Default, pruned J48 classifier is about 75% accurate 36 Section 3.5: Pruning

Section 3.5: Pruning Pruning Accuracy drops to about 69% when unpruned! 37 Section 3.5: Pruning

Section 3.5: Pruning Pruning Typically, a classifier will stop splitting nodes once they get very small Classifiers will build the full tree and then start working in from the leaves. A statistical test is used to determine which leaves will be pruned Interior nodes can also be pruned to move lower levels of the tree upward (this is the default behavior) How do classifiers do their pruning? Section 3.5: Pruning 38

Section 3.5: Pruning Pruning Over-fitting! Classifier works well on training data, but the independent test data is too complex Sometimes a decision tree is too complex Simplifying the tree could improve accuracy as well as performance Pruning is not restricted to trees! Can be used on other data structures to improve performance Why Pruning? Section 3.5: Pruning 39

Section 3.6: Pruning Rote learning: simplest form of learning To classify a new instance, search training set for one that is most like it The instances themselves are the knowledge lazy learning: do nothing till you make predictions No decision tree 40

Section 3.6: Same class Pruning 41

Section 3.6: Search training set for one that is most like it Pruning Thus, we need a similarity function Training data young: 20, 22, 28, 30 middle-age: 36, 38, 65, 56 elderly: 75, 78, 95, 100 Average 25 45 85 Test data 19 40 83 42

Section 3.6: What is noisy instances? Pruning They are incorrect instances in the training set 43

Section 3.6: Pruning 44

Section 3.6: Pruning Advantages: Accurate because it will make predictions based on current data Disadvantages: Speed slow because it will recalculate each time 45

Thank you! Any Questions? 46