CS145: INTRODUCTION TO DATA MINING

Similar documents
Lecture 7 Decision Tree Classifier

Decision Tree And Random Forest

Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Classification and Prediction

CS145: INTRODUCTION TO DATA MINING

Learning Decision Trees

the tree till a class assignment is reached

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Machine Learning & Data Mining

CS 412 Intro. to Data Mining

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

CSE 5243 INTRO. TO DATA MINING

Decision Tree Learning

CHAPTER-17. Decision Tree Induction

Learning Decision Trees

CS145: INTRODUCTION TO DATA MINING

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Machine Learning 2nd Edi7on

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES

Classification: Decision Trees

CS6220: DATA MINING TECHNIQUES

CS 6375 Machine Learning

Machine Learning Recitation 8 Oct 21, Oznur Tastan

day month year documentname/initials 1

Lecture 7: DecisionTrees

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Machine Learning 3. week

Decision Trees. Tirgul 5

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Data Mining: Concepts and Techniques

UVA CS 4501: Machine Learning

Data Mining: Concepts and Techniques

Decision Trees. Gavin Brown

CS6220: DATA MINING TECHNIQUES

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Classification and Prediction

C4.5 - pruning decision trees

Induction of Decision Trees

Chapter 6: Classification


EECS 349:Machine Learning Bryan Pardo

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Learning Decision Trees

Decision Tree Learning Lecture 2

CS6220: DATA MINING TECHNIQUES

CSCI 5622 Machine Learning

Deconstructing Data Science

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Dan Roth 461C, 3401 Walnut

Lecture 3: Decision Trees

Data Mining and Knowledge Discovery: Practice Notes

Decision T ree Tree Algorithm Week 4 1

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Lecture 3: Decision Trees

Supervised Learning via Decision Trees

Decision Trees: Overfitting

1 Handling of Continuous Attributes in C4.5. Algorithm

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Decision Trees Entropy, Information Gain, Gain Ratio

CS6220: DATA MINING TECHNIQUES

Einführung in Web- und Data-Science

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Lecture VII: Classification I. Dr. Ouiem Bchir

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Statistics and learning: Big Data

Classification Using Decision Trees

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Generative v. Discriminative classifiers Intuition

Decision Trees.

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Classification and regression trees

Decision Tree. Decision Tree Learning. c4.5. Example

UNIT 6 Classification and Prediction

Inductive Learning. Chapter 18. Material adopted from Yun Peng, Chuck Dyer, Gregory Piatetsky-Shapiro & Gary Parker

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

CS6220: DATA MINING TECHNIQUES

Statistical Machine Learning from Data

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Classification. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

Classification and Regression Trees

Introduction to Machine Learning CMU-10701

SF2930 Regression Analysis

Informal Definition: Telling things apart

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Transcription:

CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017

Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering Logistic Regression; Decision Tree; KNN SVM; NN K-means; hierarchical clustering; DBSCAN; Mixture Models Naïve Bayes for Text PLSA Prediction Frequent Pattern Mining Similarity Search Linear Regression GLM* Apriori; FP growth GSP; PrefixSpan DTW 2

Vector Data: Trees Tree-based Prediction and Classification Classification Trees Regression Trees* Random Forest Summary 3

Tree-based Models Use trees to partition the data into different regions and make predictions Root node age? <=30 overcast 31..40 >40 Internal nodes student? yes credit rating? no yes excellent fair no yes yes Leaf nodes 4

Easy to Interpret A path from root to a leaf node corresponds to a rule E.g., if age<=30 and student=no then target value=no age? <=30 31..40 >40 student? yes credit rating? no yes excellent fair no yes yes 5

Vector Data: Trees Tree-based Prediction and Classification Classification Trees Regression Trees* Random Forest Summary 6

Decision Tree Induction: An Example Training data set: Buys_xbox The data set follows an example of Quinlan s ID3 (Playing Tennis) Resulting tree: age? <=30 overcast 31..40 >40 age income student credit_rating buys_xbox <=30 high no fair no <=30 high no excellent no 31 40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31 40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31 40 medium no excellent yes 31 40 high yes fair yes >40 medium no excellent no student? yes credit rating? no yes excellent fair no yes yes 7

How to choose attributes? Ages Credit_Rating <=30 31 40 >40 Excellent Fair No No No No No VS. No No No No No Q: Which attribute is better for the classification task? 8

Brief Review of Entropy Entropy (Information Theory) A measure of uncertainty (impurity) associated with a random variable Calculation: For a discrete random variable Y taking m distinct values {y 1,, y m }, H Y = σ m i=1 p i log(p i ), where p i = P(Y = y i ) Interpretation: Higher entropy => higher uncertainty Lower entropy => lower uncertainty m = 2 9

Conditional Entropy How much uncertainty of Y if we know an attribute X? H Y X = σ x p x H(Y X = x) Ages <=30 31 40 >40 No No No No No Weighted average of entropy at each branch! 10

Attribute Selection Measure: Information Gain (ID3/C4.5) Select the attribute with the highest information gain Let p i be the probability that an arbitrary tuple in D belongs to class C i, estimated by C i, D / D Expected information (entropy) needed to classify a tuple in D: m Info( D) p i log 2( p Information needed (after using A to split D into v partitions) to classify D (conditional entropy): Info A Information gained by branching on attribute A Gain(A) Info(D) Info (D) i 1 v D j ( D) D j 1 i ) Info( D A j ) 11

Attribute Selection: Information Gain Info Class P: buys_xbox = yes Class N: buys_xbox = no 9 D) I(9,5) log 14 9 5 ( ) log 14 14 age p i n i I(p i, n i ) <=30 2 3 0.971 31 40 4 0 0 >40 3 2 0.971 5 ( ) 14 ( 2 2 age income student credit_rating buys_xbox <=30 high no fair no <=30 high no excellent no 31 40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31 40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31 40 medium no excellent yes 31 40 high yes fair yes >40 medium no excellent no 0.940 Info age ( D) no s. Hence Similarly, 5 14 5 14 I(2,3) I(3,2) 4 14 0.694 Gain( income) 0.029 Gain( student) 0.151 Gain( credit _ rating) I(4,0) 5 I (2,3) means age <=30 has 5 out of 14 14 samples, with 2 yes es and 3 Gain( age) Info( D) Info ( D) 0.246 age 0.048 12

Attribute Selection for a Branch age? <=30 overcast 31..40 >40 Info D age 30 = 2 log 2 5 2 3 log 3 5 5 2 = 0.971 5 Gain age 30 income = Info D age 30 Info income D age 30 = 0.571 Gain age 30 student = 0.971 Gain age 30 credit_rating = 0.02? yes? age? Which attribute next? age income student credit_rating buys_xbox <=30 high no fair no <=30 high no excellent no <=30 medium no fair no <=30 low yes fair yes <=30 medium yes excellent yes D age 30 <=30 overcast 31..40 >40 student? yes? no yes no yes 13

Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf There are no samples left use majority voting in the parent partition 14

Computing Information-Gain for Continuous-Valued Attributes Let attribute A be a continuous-valued attribute Must determine the best split point for A Sort the value A in increasing order Typically, the midpoint between each pair of adjacent values is considered as a possible split point (a i +a i+1 )/2 is the midpoint between the values of a i and a i+1 The point with the minimum expected information requirement for A is selected as the split-point for A Split: D1 is the set of tuples in D satisfying A split-point, and D2 is the set of tuples in D satisfying A > split-point 15

Gain Ratio for Attribute Selection (C4.5) Information gain measure is biased towards attributes with a large number of values C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain) Ex. SplitInfo v D j ( D) log 2 D GainRatio(A) = Gain(A)/SplitInfo(A) A D ( j j 1 D ) gain_ratio(income) = 0.029/1.557 = 0.019 The attribute with the maximum gain ratio is selected as the splitting attribute 16

Gini Index (CART, IBM IntelligentMiner) If a data set D contains examples from n classes, gini index, gini(d) is defined as v gini( D) 1 p 2 j j 1 where p j is the relative frequency of class j in D If a data set D is split on A into two subsets D 1 and D 2, the gini index gini(d) is defined as D ) 1 D ( ) 2 gini A D gini D1 gini( D D D Reduction in Impurity: ( 2 gini( A) gini( D) gini ( D) The attribute provides the smallest gini split (D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute) A ) 17

Computation of Gini Index Ex. D has 9 tuples in buys_computer = yes and 5 in no 9 gini( D) 1 14 5 14 0.459 Suppose the attribute income partitions D into 10 in D 1 : {low, medium} and 4 in D 2 : {high} gini 2 2 10 4 ( D) Gini( D1 ) Gini( 2) 14 14 income { low, medium} D Gini {low,high} is 0.458; Gini {medium,high} is 0.450. Thus, split on the {low,medium} (and {high}) since it has the lowest Gini index 18

Comparing Attribute Selection Measures The three measures, in general, return good results but Information gain: biased towards multivalued attributes Gain ratio: tends to prefer unbalanced splits in which one partition is much smaller than the others (why?) Gini index: biased to multivalued attributes 19

*Other Attribute Selection Measures CHAID: a popular decision tree algorithm, measure based on χ 2 test for independence C-SEP: performs better than info. gain and gini index in certain cases G-statistic: has a close approximation to χ 2 distribution MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred): The best tree as the one that requires the fewest # of bits to both (1) encode the tree, and (2) encode the exceptions to the tree Multivariate splits (partition based on multiple variable combinations) CART: finds multivariate splits based on a linear comb. of attrs. Which attribute selection measure is the best? Most give good results, none is significantly superior than others 20

Overfitting and Tree Pruning Overfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliers Poor accuracy for unseen samples Two approaches to avoid overfitting Prepruning: Halt tree construction early do not split a node if this would result in the goodness measure falling below a threshold Difficult to choose an appropriate threshold Postpruning: Remove branches from a fully grown tree get a sequence of progressively pruned trees Use validation dataset to decide which is the best pruned tree 21

Vector Data: Trees Tree-based Prediction and Classification Classification Trees Regression Trees* Random Forest Summary 22

From Classification to Prediction Target variable From categorical variable to continuous variable Attribute selection criterion Measure the purity of continuous target variable in each partition Leaf node A simple model for that partition, e.g., average 23

Attribute Selection Reduction of Variance For attribute A, weighted average variance v D j VarA ( D) Var( D j ) D j 1 Var D j = y D j y തy 2 / D j, where തy = y D j y / D j Pick the attribute with the lowest weighted average variance 24

Leaf Node Model Take the average of the partition for leave node l y l = σ y Dl y / D l 25

Example: Predict Baseball Player Salary Dataset: (years, hits)=>salary Colors indicate value of salary (blue: low, red: high) 26

A Regression Tree Built 27

A Different Angle to View the Tree A leaf is corresponding to a box in the plane R3: Year > 4.5 & Hits >= 117.5 R1: Year < 4.5 R2: Year > 4.5 & Hits < 117.5 28

Trees vs. Linear Models Ground Truth: Linear Boundary Ground Truth: Non-Linear Boundary Fitted Model: Linear Model Fitted Model: Trees 29

Vector Data: Trees Tree-based Prediction and Classification Classification Trees Regression Trees* Random Forest Summary 30

A Single Tree or a Set of Trees? Limitation of single tree Accuracy is not very high Overfitting A set of trees The idea of ensemble 31

The Idea of Bagging Bagging: Bootstrap Aggregating 32

Why It Works? Each classifier produces the prediction f i x The error will be reduced if we use the average of multiple classifiers var σ i f i x t = var(f i x )/t 33

Random Forest Sample t times data collection: random sample with replacement for objects, n n Sample p variables: Select a subset of variables for each data collection, e.g., p = p Construct t trees for each data collection using selected subset of variables Aggregate the prediction results for new data Majority voting for classification Average for prediction 34

Properties of Random Forest Strengths Good accuracy for classification tasks Can handle large-scale of dataset Can handle missing data to some extent Weaknesses Not so good for predictions tasks Lack of interpretation 35

Vector Data: Trees Tree-based Prediction and Classification Classification Trees Regression Trees* Random Forest Summary 36

Summary Classification Trees Predict categorical labels, information gain, tree construction Regression Trees* Predict numerical variable, variance reduction Random Forest A set of trees, bagging 37