# Lecture 3: Decision Trees

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Lecture 3: Decision Trees Cognitive Systems - Machine Learning Part I: Basic Approaches of Concept Learning ID3, Information Gain, Overfitting, Pruning last change November 26, 2014 Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

2 Outline Outline Decision Trees ID3 Information Gain Overfitting Pruning Summary Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

3 Decision Tree Decision Tree Representation classification of instances by sorting them down the tree from the root to some leaf node node test of some attribute branch one of the possible values for the attribute decision trees represent a disjunction of conjunctions of constraints on the attribute values of instances i.e., ( ) ( )... equivalent to a set of if-then-rules each branch represents one if-then-rule if-part: conjunctions of attribute tests on the nodes then-part: classification of the branch Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

4 Decision Tree Decision Tree Representation This decision tree is equivalent to: if (Outlook = Sunny) (Humidity = Normal) then Yes; if (Outlook = Overcast) then Yes; if (Outlook = Rain) (Wind = Weak) then Yes; Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

5 Decision Tree Appropriate Problems Instances are represented by attribute-value pairs, e.g. (Temperature, Hot) Target function has discrete output values, e.g. yes or no (concept/classification learning) Disjunctive descriptions may be required Training data may contain errors Training data may contain missing attribute values last three points make Decision Tree Learning more attractive than CANDIDATE-ELIMINATION Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

6 ID3 ID3 learns decision trees by constructing them top-down employs a greedy search algorithm without backtracking through the space of all possible decision trees finds a short tree (wrt path length) but not necessarily the best decision tree key idea: selection of the next attribute according to a statistical measure all examples are considered at the same time (simultaneous covering) recursive application with reduction of selectable attributes until each training example can be classified unambiguously Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

7 ID3 Algorithm for Concept Learning ID3(Examples, Target attribute, Attributes) Create a Root for the tree ID3 If all examples are positive, Return single-node tree Root, with label = + If all examples are negative, Return single-node tree Root, with label = If Attributes is empty, Return single-node tree Root, with label = most common value of Target attribute in Examples Otherwise, Begin A attribute in Attributes that best classifies Examples Decision attribute for Root A For each possible value v i of A Add new branch below Root with A = v i Let Examples vi be the subset of Examples with v i for A If Examples vi is empty Then add a leaf node with label = most common value of Target attribute in Examples Else add ID3(Examples vi, Target Attribute, Attributes {A}) Return Root Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

8 The best classifier Information Gain central choice: Which attribute classifies the examples best? ID3 uses the information gain statistical measure that indicates how well a given attribute separates the training examples according to their target classification Gain(S, A) Entropy(S) }{{} original entropy of S S v S Entropy(S v) v values(a) }{{} relative entropy of S interpretation: denotes the reduction in entropy caused by partitioning S according to A alternative: number of saved yes/no questions (i.e., bits) attribute with max A Gain(S, A) is selected! Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

9 Information Gain Entropy statistical measure from information theory that characterizes (im-)purity of an arbitrary collection of examples S for binary classification: H(S) p log 2 p p log 2 p for n-ary classification: H(S) n i=1 p i log 2 p i interpretation: specification of the minimum number of bits of information needed to encode the classification of an arbitrary member of S alternative: number of yes/no questions Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

10 Entropy Information Gain minimum of H(S) for minimal impurity point distribution H(S) = 0 maximum of H(S) for maximal impurity uniform distribution for binary classification: H(S) = 1 for n-ary classification: H(S) = log 2 n Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

11 Illustrative Example Information Gain example days (Feature sky with values sunny, rainy is replaced by outlook with three values) Day Outlook Temp. Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

12 Illustrative Example Information Gain entropy of S S = {D1,..., D14} = [9+, 5 ] H(S) = 9 14 log log = information gain (e.g. Wind) S Weak = {D1, D3, D4, D5, D8, D9, D10, D13} = [6+, 2 ] S Strong = {D2, D6, D7, D11, D12, D14} = [3+, 3 ] Gain(S, Wind) = H(S) v Wind S v S H(S v) = H(S) 8 14 H(S Weak) 6 14 H(S Strong) = = Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

13 Illustrative Example Information Gain Which attribute is the best classifier? Gain (S, Humidity) = ( 14 7 ) ( 14 7 ) = Gain (S, Wind) = ( 14 8 ) ( 14 6 ) 1.0 = Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

14 Information Gain Illustrative Example informations gains for the four attributes: Gain(S, Outlook) = Gain(S, Humidity) = Gain(S, Wind) = Gain(S, Temperature) = Outlook is selected as best classifier and is therefore Root of the tree now branches are created below the root for each possible value because every example for which Outlook = Overcast is positive, this node becomes a leaf node with the classification Yes the other descendants are still ambiguous (H(S) 0) hence, the decision tree has to be further elaborated below these nodes Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

15 Illustrative Example Information Gain Which attribute should be tested here? S sunny = {D1, D2, D8, D9, D11} Gain(S sunny, Humidity) = ( 3 5 ) 0.0 ( 2 ) 0.0 = Gain(S sunny, Temperature) = ( 2 5 ) 0.0 ( 2 5 ) 1.0 ( 1 ) 0.0 = Gain(S sunny, Humidity) = ( 2 5 ) 1.0 ( 3 ) = Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

16 Illustrative Example Information Gain Resulting decision tree Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

17 Information Gain Hypothesis Space Search H complete space of finite discrete functions, relative to the available attributes (i.e. all possible decision trees) capabilities and limitations: returns just one single consistent hypothesis performs greedy search (i.e., max A Gain(S, A)) susceptible to the usual risks of hill-climbing without backtracking uses all training examples at each step simultaneous covering Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

18 Inductive Bias Inductive Bias As mentioned above, ID3 searches complete space of possible hypotheses (wrt instance space), but not completely Preference Bias Inductive Bias: Shorter trees are preferred to longer trees. Trees that place high information gain attributes close to the root are also preferred. Why prefer shorter hypotheses? Occam s Razor: Prefer the simplest hypothesis that fits the data! (aka W. Ockham) see Minimum Description Length Principle (Bayesian Learning) e.g., if there are two decision trees, one with 500 nodes and another with 5 nodes, the second one should be preferred better chance to avoid overfitting Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

19 Overfitting Overfitting Given a hypothesis space H, a hypothesis h H is said to overfit the training data if there exists some alternative hypothesis h H, such that h has smaller error than h over the training data, but h has smaller error than h over the entire distribution of instances. Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

20 Overfitting Example Overfitting Day Outlook Temp. Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No D15 Sunny Hot Normal Strong No Wrong classification in last example (noise) Resulting tree is more complex and has different structure Tree still fits training set but wrong classification of unseen examples Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

21 Overfitting Overfitting reasons for overfitting: noise in the data number of training examples is too small to produce a representative sample of the target function how to avoid overfitting: stop the tree growing earlier, before it reaches the point where it perfectly classifies the training data, i.e. create a leaf and assign the most common concept allow overfitting and then post-prune the tree (more successful in practice!) how to determine the perfect tree size: separate validation set to evaluate utility of post-pruning apply statistical test to estimate whether expanding (or pruning) produces an improvement Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

22 Overfitting Training Set, Validation Set and Test Set Training Set used to form the learned hypothesis Validation Set separated from training set used to evaluate the accuracy of learned hypothesis over subsequent data (in particular) used to evaluate the impact of pruning this hypothesis Test Set separated from training set used only to evaluate the accuracy of learned hypothesis over subsequent data no more learning / adjustment of the parameters when applying the test set Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

23 Pruning Post-Pruning Prune the tree after it has been generated to avoid overfitting. Two approaches: 1 Reduced Error Pruning 2 Rule Post-Pruning Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

24 Pruning Reduced Error Pruning each of the decision nodes is considered to be a candidate for pruning pruning a decision node consists of removing the sub-tree rooted at the node, making it a leaf node and assigning the most common classification of the training examples affiliated with that node nodes are removed only if the resulting tree performs not worse than the original tree over the validation set pruning starts with the node whose removal most increases accuracy and continues until further pruning is harmful Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

25 Reduced Error Pruning Pruning effect of reduced error pruning: any node added to coincidental regularities in the training set is likely to be pruned the stronger the pruning (less number of nodes), the better is the fitting to the test set the validation set used for pruning is distinct from both the training and test sets Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

26 Rule Post-Pruning Pruning rule post-pruning involves the following steps: 1 Infer the decision tree from the training set (Overfitting allowed!) 2 Convert the tree into a set of rules 3 Prune each rule by removing any preconditions that result in improving its estimated accuracy 4 Sort the pruned rules by their estimated accuracy one method to estimate rule accuracy is to use a separate validation set pruning rules is more precise than pruning the tree itself Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

27 Pruning Alternatives to Information Gain natural bias in information gain favors attributes with many values over those with few values e.g. attribute Date very large number of values (e.g. March 21, 2005) inserted in the above example, it would have the highest information gain, because it perfectly separates the training data but the classification of unseen examples would be impossible alternative measure: GainRatio GainRatio(S, A) = Gain(S,A) SplitInformation(S,A) SplitInformation(S, A) n S i i=1 log S i S 2 S SplitInformation(S, A) is sensitive to how broadly and uniformly A splits S (entropy of S with respect to the values of A) GainRatio penalizes attributes such as Date Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

28 Real-Valued Attributes Pruning Decision tree learning with real valued attributes is possible Discretization of data by split in two ranges (ID3) estimating number of best discriminating ranges (CAL5) One attribute must be allowed to occur more than one time on a path in the decision tree! Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

29 Pruning General ML-Methods Determining the Generalization Error Simple method: Randomly select part of the data from the training set for a test set Unbiased estimate of error because hypothesis is chosen independently of test cases But: the estimated error may still vary from the true error! Estimate the confidence interval in which the true error lies with a certain probability (see Mitchell, chap. 5) Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

30 General ML-Methods Cross Validation Pruning For many learning algorithms, it is useful to provide an additional validation set (e.g. for parameter fitting) k-fold cross validation: partition training set with m examples into k disjoint subsets of size m k run k times with a different subset as validation set each times (using the combined other subsets as training set) Calculate the mean of your estimates over the k runs Last run with complete training set and parameters fixed to the estimates Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

31 Compare Learners Pruning Which learner obtains better results in some domain? Compare whether generalization error of one is significantly lower than of the other Use similar procedure to k-fold cross-validation to obtain data for inference statistical comparison (see Mitchell, chap. 5) Source: Michael Wurtz eecs349michaelwurtz/ Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

32 General ML-Methods Bagging and Boosting Pruning Bagging (bootstrap aggregation, Breimann, 1996): Calculate M classifiers (e.g. decision trees) over different bootstrap samples Prediction by majority vote Boosting (Freund & Schapire, 1996) Additionally introduce weights for each classifier which are iteratively adjusted (due to classification failure/success) see diploma thesis of Jörg Mennicke: Classifier Learning for Imbalanced Data with Varying Misclassification Costs - A Comparison of knn, SVM, and Decision Tree Learning (2006) Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

33 Pruning The Problem of Imbalanced Data In realistic settings occurrence of different classes might be imbalanced (e.g. cancer screening, quality control) Undersampling (remove examples for the over-represented class), oversampling (clone data for the under-represented class) Estimate is worse for the class which occurs more seldom, this might be the class with higher misclassification costs (e.g. decide no cancer if true class is cancer) Instead of over-/, undersampling, introduce different costs for misclassification s and calculate weighted error measure! see e.g.: Tom Hecker and Jörg Mennicke, Diagnosing Cancerous Abnormalities with Decision Tree Learning, Student Project in cooperation with Fraunhofer IIS, Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

34 Summary Summary Decision tree learning is a practical and intuitively understandable method for concept learning Symbolic (rule-based) representation of hypotheses Able to learn disjunctive, discrete-valued concepts Noise in the data is allowed ID3 is a simultaneous covering algorithm based on information gain that performs a greedy top-down search through the space of possible decision trees The inductive bias of DT-algorithm is that short trees are preferred (Ockham s Razor) Overfitting is an important issue and can be reduced by pruning To estimate the generalization error of a hypothesis, typically k-fold cross validation is used Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

35 Learning Terminology Summary ID3 Supervised Learning Approaches: Concept / Classification symbolic inductive Learning Strategy: Data: Target Values: unsupervised learning Policy Learning statistical / neuronal network analytical learning from examples categorial/metric features concept/class Ute Schmid (CogSys, WIAI) ML Decision Trees November 26, / 35

### Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Introduction Decision Tree Learning Practical methods for inductive inference Approximating discrete-valued functions Robust to noisy data and capable of learning disjunctive expression ID3 earch a completely

### Decision Tree Learning

Decision Tree Learning Berlin Chen Department of Computer Science & Information Engineering National Taiwan Normal University References: 1. Machine Learning, Chapter 3 2. Data Mining: Concepts, Models,

### Chapter 3: Decision Tree Learning

Chapter 3: Decision Tree Learning CS 536: Machine Learning Littman (Wu, TA) Administration Books? New web page: http://www.cs.rutgers.edu/~mlittman/courses/ml03/ schedule lecture notes assignment info.

### Decision Tree Learning and Inductive Inference

Decision Tree Learning and Inductive Inference 1 Widely used method for inductive inference Inductive Inference Hypothesis: Any hypothesis found to approximate the target function well over a sufficiently

### Decision Tree Learning

0. Decision Tree Learning Based on Machine Learning, T. Mitchell, McGRAW Hill, 1997, ch. 3 Acknowledgement: The present slides are an adaptation of slides drawn by T. Mitchell PLAN 1. Concept learning:

### Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Administration Chapter 3: Decision Tree Learning (part 2) Book on reserve in the math library. Questions? CS 536: Machine Learning Littman (Wu, TA) Measuring Entropy Entropy Function S is a sample of training

### Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Introduction to ML Two examples of Learners: Naïve Bayesian Classifiers Decision Trees Why Bayesian learning? Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical

### Machine Learning & Data Mining

Group M L D Machine Learning M & Data Mining Chapter 7 Decision Trees Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University Top 10 Algorithm in DM #1: C4.5 #2: K-Means #3: SVM

### Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees! Summary! Input Knowledge representation! Preparing data for learning! Input: Concept, Instances, Attributes"

### Chapter 3: Decision Tree Learning (part 2)

Chapter 3: Decision Tree Learning (part 2) CS 536: Machine Learning Littman (Wu, TA) Administration Books? Two on reserve in the math library. icml-03: instructional Conference on Machine Learning mailing

### Decision Tree Learning

Topics Decision Tree Learning Sattiraju Prabhakar CS898O: DTL Wichita State University What are decision trees? How do we use them? New Learning Task ID3 Algorithm Weka Demo C4.5 Algorithm Weka Demo Implementation

### Decision Trees. Danushka Bollegala

Decision Trees Danushka Bollegala Rule-based Classifiers In rule-based learning, the idea is to learn a rule from train data in the form IF X THEN Y (or a combination of nested conditions) that explains

### Introduction Association Rule Mining Decision Trees Summary. SMLO12: Data Mining. Statistical Machine Learning Overview.

SMLO12: Data Mining Statistical Machine Learning Overview Douglas Aberdeen Canberra Node, RSISE Building Australian National University 4th November 2004 Outline 1 Introduction 2 Association Rule Mining

### the tree till a class assignment is reached

Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

### Apprentissage automatique et fouille de données (part 2)

Apprentissage automatique et fouille de données (part 2) Telecom Saint-Etienne Elisa Fromont (basé sur les cours d Hendrik Blockeel et de Tom Mitchell) 1 Induction of decision trees : outline (adapted

### EECS 349:Machine Learning Bryan Pardo

EECS 349:Machine Learning Bryan Pardo Topic 2: Decision Trees (Includes content provided by: Russel & Norvig, D. Downie, P. Domingos) 1 General Learning Task There is a set of possible examples Each example

### CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING. Santiago Ontañón

CS 380: ARTIFICIAL INTELLIGENCE MACHINE LEARNING Santiago Ontañón so367@drexel.edu Summary so far: Rational Agents Problem Solving Systematic Search: Uninformed Informed Local Search Adversarial Search

### Tutorial 6. By:Aashmeet Kalra

Tutorial 6 By:Aashmeet Kalra AGENDA Candidate Elimination Algorithm Example Demo of Candidate Elimination Algorithm Decision Trees Example Demo of Decision Trees Concept and Concept Learning A Concept

### Decision Tree Learning

Decision Tree Learning Goals for the lecture you should understand the following concepts the decision tree representation the standard top-down approach to learning a tree Occam s razor entropy and information

### Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan, Steinbach, Kumar Adapted by Qiang Yang (2010) Tan,Steinbach,

### Induction on Decision Trees

Séance «IDT» de l'ue «apprentissage automatique» Bruno Bouzy bruno.bouzy@parisdescartes.fr www.mi.parisdescartes.fr/~bouzy Outline Induction task ID3 Entropy (disorder) minimization Noise Unknown attribute

### Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

15-0: Learning vs. Deduction Artificial Intelligence Programming Bayesian Learning Chris Brooks Department of Computer Science University of San Francisco So far, we ve seen two types of reasoning: Deductive

### Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

### Decision Tree Learning

Decision Tree Learning Goals for the lecture you should understand the following concepts the decision tree representation the standard top-down approach to learning a tree Occam s razor entropy and information

### Classification and Regression Trees

Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

### Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Decision Trees Lewis Fishgold (Material in these slides adapted from Ray Mooney's slides on Decision Trees) Classification using Decision Trees Nodes test features, there is one branch for each value of

### Concept Learning. Space of Versions of Concepts Learned

Concept Learning Space of Versions of Concepts Learned 1 A Concept Learning Task Target concept: Days on which Aldo enjoys his favorite water sport Example Sky AirTemp Humidity Wind Water Forecast EnjoySport

### Overview. Machine Learning, Chapter 2: Concept Learning

Overview Concept Learning Representation Inductive Learning Hypothesis Concept Learning as Search The Structure of the Hypothesis Space Find-S Algorithm Version Space List-Eliminate Algorithm Candidate-Elimination

### Decision Trees Entropy, Information Gain, Gain Ratio

Changelog: 14 Oct, 30 Oct Decision Trees Entropy, Information Gain, Gain Ratio Lecture 3: Part 2 Outline Entropy Information gain Gain ratio Marina Santini Acknowledgements Slides borrowed and adapted

### Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Classification: Rule Induction Information Retrieval and Data Mining Prof. Matteo Matteucci What is Rule Induction? The Weather Dataset 3 Outlook Temp Humidity Windy Play Sunny Hot High False No Sunny

### Bayesian Learning Features of Bayesian learning methods:

Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease or increase the estimated probability that a hypothesis is correct. This provides a more

### From inductive inference to machine learning

From inductive inference to machine learning ADAPTED FROM AIMA SLIDES Russel&Norvig:Artificial Intelligence: a modern approach AIMA: Inductive inference AIMA: Inductive inference 1 Outline Bayesian inferences

### Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

### Supervised Learning (contd) Decision Trees. Mausam (based on slides by UW-AI faculty)

Supervised Learning (contd) Decision Trees Mausam (based on slides by UW-AI faculty) Decision Trees To play or not to play? http://www.sfgate.com/blogs/images/sfgate/sgreen/2007/09/05/2240773250x321.jpg

### Learning Decision Trees

Learning Decision Trees CS194-10 Fall 2011 Lecture 8 CS194-10 Fall 2011 Lecture 8 1 Outline Decision tree models Tree construction Tree pruning Continuous input features CS194-10 Fall 2011 Lecture 8 2

### Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Review of Lecture 1 This course is about finding novel actionable patterns in data. We can divide data mining algorithms (and the patterns they find) into five groups Across records Classification, Clustering,

### Modern Information Retrieval

Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

### Statistical Machine Learning from Data

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

### ( D) I(2,3) I(4,0) I(3,2) weighted avg. of entropies

Decision Tree Induction using Information Gain Let I(x,y) as the entropy in a dataset with x number of class 1(i.e., play ) and y number of class (i.e., don t play outcomes. The entropy at the root, i.e.,

### CHAPTER-17. Decision Tree Induction

CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes

### A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Decision Trees, cont. Boosting Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 1 st, 2007 1 A Decision Stump 2 1 The final tree 3 Basic Decision Tree Building Summarized

### Version Spaces.

. Machine Learning Version Spaces Prof. Dr. Martin Riedmiller AG Maschinelles Lernen und Natürlichsprachliche Systeme Institut für Informatik Technische Fakultät Albert-Ludwigs-Universität Freiburg riedmiller@informatik.uni-freiburg.de

### Stephen Scott.

1 / 28 ian ian Optimal (Adapted from Ethem Alpaydin and Tom Mitchell) Naïve Nets sscott@cse.unl.edu 2 / 28 ian Optimal Naïve Nets Might have reasons (domain information) to favor some hypotheses/predictions

### 16.4 Multiattribute Utility Functions

285 Normalized utilities The scale of utilities reaches from the best possible prize u to the worst possible catastrophe u Normalized utilities use a scale with u = 0 and u = 1 Utilities of intermediate

### Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

### Decision Trees (Cont.)

Decision Trees (Cont.) R&N Chapter 18.2,18.3 Side example with discrete (categorical) attributes: Predicting age (3 values: less than 30, 30-45, more than 45 yrs old) from census data. Attributes (split

### Bayesian Classification. Bayesian Classification: Why?

Bayesian Classification http://css.engineering.uiowa.edu/~comp/ Bayesian Classification: Why? Probabilistic learning: Computation of explicit probabilities for hypothesis, among the most practical approaches

### Learning from Observations. Chapter 18, Sections 1 3 1

Learning from Observations Chapter 18, Sections 1 3 Chapter 18, Sections 1 3 1 Outline Learning agents Inductive learning Decision tree learning Measuring learning performance Chapter 18, Sections 1 3

### Information Gain, Decision Trees and Boosting ML recitation 9 Feb 2006 by Jure

Information Gain, Decision Trees and Boosting 10-701 ML recitation 9 Feb 2006 by Jure Entropy and Information Grain Entropy & Bits You are watching a set of independent random sample of X X has 4 possible

### C4.5 - pruning decision trees

C4.5 - pruning decision trees Quiz 1 Quiz 1 Q: Is a tree with only pure leafs always the best classifier you can have? A: No. Quiz 1 Q: Is a tree with only pure leafs always the best classifier you can

### CC283 Intelligent Problem Solving 28/10/2013

Machine Learning What is the research agenda? How to measure success? How to learn? Machine Learning Overview Unsupervised Learning Supervised Learning Training Testing Unseen data Data Observed x 1 x

### Bias Correction in Classification Tree Construction ICML 2001

Bias Correction in Classification Tree Construction ICML 21 Alin Dobra Johannes Gehrke Department of Computer Science Cornell University December 15, 21 Classification Tree Construction Outlook Temp. Humidity

### Introduction to Artificial Intelligence. Learning from Oberservations

Introduction to Artificial Intelligence Learning from Oberservations Bernhard Beckert UNIVERSITÄT KOBLENZ-LANDAU Winter Term 2004/2005 B. Beckert: KI für IM p.1 Outline Learning agents Inductive learning

### Bias-Variance in Machine Learning

Bias-Variance in Machine Learning Bias-Variance: Outline Underfitting/overfitting: Why are complex hypotheses bad? Simple example of bias/variance Error as bias+variance for regression brief comments on

### Machine Learning 2D1431. Lecture 3: Concept Learning

Machine Learning 2D1431 Lecture 3: Concept Learning Question of the Day? What row of numbers comes next in this series? 1 11 21 1211 111221 312211 13112221 Outline Learning from examples General-to specific

### Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

### Decision T ree Tree Algorithm Week 4 1

Decision Tree Algorithm Week 4 1 Team Homework Assignment #5 Read pp. 105 117 of the text book. Do Examples 3.1, 3.2, 3.3 and Exercise 3.4 (a). Prepare for the results of the homework assignment. Due date

### Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Decision Trees Tobias Scheffer Decision Trees One of many applications: credit risk Employed longer than 3 months Positive credit

### PATTERN CLASSIFICATION

PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

### GROUND RESISTANCE ESTIMATION USING INDUCTIVE MACHINE LEARNING

The 19 th International Symposium on High Voltage Engineering, Pilsen, Czech Republic, August, 23 28, 215 GROUND RESISTANCE ESTIMATION USING INDUCTIVE MACHINE LEARNING V. P. Androvitsaneas *1, I. F. Gonos

### Moving Average Rules to Find. Confusion Matrix. CC283 Intelligent Problem Solving 05/11/2010. Edward Tsang (all rights reserved) 1

Machine Learning Overview Supervised Learning Training esting Te Unseen data Data Observed x 1 x 2... x n 1.6 7.1... 2.7 1.4 6.8... 3.1 2.1 5.4... 2.8... Machine Learning Patterns y = f(x) Target y Buy

### Decision Trees. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Decision Trees These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these slides

### Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Chapter ML:III III. Decision Trees Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning ML:III-34 Decision Trees STEIN/LETTMANN 2005-2017 Splitting Let t be a leaf node

### LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES Supervised Learning Linear vs non linear classifiers In K-NN we saw an example of a non-linear classifier: the decision boundary

### Machine Learning

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 13, 2011 Today: The Big Picture Overfitting Review: probability Readings: Decision trees, overfiting

### Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

CSE 473 Chapter 18 Decision Trees and Ensemble Learning Recall: Learning Decision Trees Example: When should I wait for a table at a restaurant? Attributes (features) relevant to Wait? decision: 1. Alternate:

### Generative v. Discriminative classifiers Intuition

Logistic Regression (Continued) Generative v. Discriminative Decision rees Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University January 31 st, 2007 2005-2007 Carlos Guestrin 1 Generative

### Machine Learning. Computational Learning Theory. Eric Xing , Fall Lecture 9, October 5, 2016

Machine Learning 10-701, Fall 2016 Computational Learning Theory Eric Xing Lecture 9, October 5, 2016 Reading: Chap. 7 T.M book Eric Xing @ CMU, 2006-2016 1 Generalizability of Learning In machine learning

### Evaluation requires to define performance measures to be optimized

Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

### Numerical Learning Algorithms

Numerical Learning Algorithms Example SVM for Separable Examples.......................... Example SVM for Nonseparable Examples....................... 4 Example Gaussian Kernel SVM...............................

### Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Other Measures 1 / 52 sscott@cse.unl.edu learning can generally be distilled to an optimization problem Choose a classifier (function, hypothesis) from a set of functions that minimizes an objective function

### Unsupervised Learning. k-means Algorithm

Unsupervised Learning Supervised Learning: Learn to predict y from x from examples of (x, y). Performance is measured by error rate. Unsupervised Learning: Learn a representation from exs. of x. Learn

### Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Midterm Review CS 6375: Machine Learning Vibhav Gogate The University of Texas at Dallas Machine Learning Supervised Learning Unsupervised Learning Reinforcement Learning Parametric Y Continuous Non-parametric

### Introduction to Machine Learning (67577) Lecture 5

Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor Shai

### 1 Handling of Continuous Attributes in C4.5. Algorithm

.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Data Mining: Classification/Supervised Learning Potpourri Contents 1. C4.5. and continuous attributes: incorporating continuous

### CS 6140: Machine Learning Spring What We Learned Last Week 2/26/16

Logis@cs CS 6140: Machine Learning Spring 2016 Instructor: Lu Wang College of Computer and Informa@on Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang Email: luwang@ccs.neu.edu Sign

### Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Data Mining Andrew Kusiak 2139 Seamans Center Iowa City, Iowa 52242-1527 Preamble: Control Application Goal: Maintain T ~Td Tel: 319-335 5934 Fax: 319-335 5669 andrew-kusiak@uiowa.edu http://www.icaen.uiowa.edu/~ankusiak

### The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

) Set W () i The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m m =.5 1 n W (m 1) i y i h(x i ; 2 ˆθ

### Multivariate Analysis Techniques in HEP

Multivariate Analysis Techniques in HEP Jan Therhaag IKTP Institutsseminar, Dresden, January 31 st 2013 Multivariate analysis in a nutshell Neural networks: Defeating the black box Boosted Decision Trees:

### ECE521 Lecture7. Logistic Regression

ECE521 Lecture7 Logistic Regression Outline Review of decision theory Logistic regression A single neuron Multi-class classification 2 Outline Decision theory is conceptually easy and computationally hard

### Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

### Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Bayesian Learning Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2. (Linked from class website) Conditional Probability Probability of

### Introduction to Machine Learning

Introduction to Machine Learning Concept Learning Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1 /

### Improving M5 Model Tree by Evolutionary Algorithm

Improving M5 Model Tree by Evolutionary Algorithm Master s Thesis in Computer Science Hieu Chi Huynh May 15, 2015 Halden, Norway www.hiof.no Abstract Decision trees are potentially powerful predictors,

### Artificial Intelligence Programming Probability

Artificial Intelligence Programming Probability Chris Brooks Department of Computer Science University of San Francisco Department of Computer Science University of San Francisco p.1/?? 13-0: Uncertainty

### Classification Lecture 2: Methods

Classification Lecture 2: Methods Jing Gao SUNY Buffalo 1 Outline Basics Problem, goal, evaluation Methods Nearest Neighbor Decision Tree Naïve Bayes Rule-based Classification Logistic Regression Support

### Leveraging Randomness in Structure to Enable Efficient Distributed Data Analytics

Leveraging Randomness in Structure to Enable Efficient Distributed Data Analytics Jaideep Vaidya (jsvaidya@rbs.rutgers.edu) Joint work with Basit Shafiq, Wei Fan, Danish Mehmood, and David Lorenzi Distributed

### Midterm, Fall 2003

5-78 Midterm, Fall 2003 YOUR ANDREW USERID IN CAPITAL LETTERS: YOUR NAME: There are 9 questions. The ninth may be more time-consuming and is worth only three points, so do not attempt 9 unless you are

### The Naïve Bayes Classifier. Machine Learning Fall 2017

The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning

### Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

### Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

+ Machine Learning and Data Mining Decision Trees Prof. Alexander Ihler Decision trees Func-onal form f(x;µ): nested if-then-else statements Discrete features: fully expressive (any func-on) Structure:

### Linear Classifiers and the Perceptron

Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers Let s assume that every instance is an n-dimensional vector of real numbers x R n, and there are only two possible

### Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities ; CU- CS

University of Colorado, Boulder CU Scholar Computer Science Technical Reports Computer Science Spring 5-1-23 Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities

### EECS 349: Machine Learning

EECS 349: Machine Learning Bryan Pardo Topic 1: Concept Learning and Version Spaces (with some tweaks by Doug Downey) 1 Concept Learning Much of learning involves acquiring general concepts from specific

### CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we

### Machine Learning Lecture 7

Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

### Machine Learning, Midterm Exam

10-601 Machine Learning, Midterm Exam Instructors: Tom Mitchell, Ziv Bar-Joseph Wednesday 12 th December, 2012 There are 9 questions, for a total of 100 points. This exam has 20 pages, make sure you have