Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Similar documents
Decision Tree Analysis for Classification Problems. Entscheidungsunterstützungssysteme SS 18

CS 6375 Machine Learning

CS145: INTRODUCTION TO DATA MINING

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Decision Tree Learning

Classification: Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CHAPTER-17. Decision Tree Induction

Classification and Prediction

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees Part 1. Rao Vemuri University of California, Davis

Learning Decision Trees

Decision Trees. Lewis Fishgold. (Material in these slides adapted from Ray Mooney's slides on Decision Trees)

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Learning Decision Trees

Dan Roth 461C, 3401 Walnut

EECS 349:Machine Learning Bryan Pardo

the tree till a class assignment is reached

Decision Trees. Each internal node : an attribute Branch: Outcome of the test Leaf node or terminal node: class label.

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Holdout and Cross-Validation Methods Overfitting Avoidance

Induction of Decision Trees

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Decision Tree Learning Lecture 2

Decision Trees (Cont.)

brainlinksystem.com $25+ / hr AI Decision Tree Learning Part I Outline Learning 11/9/2010 Carnegie Mellon

Lecture 7: DecisionTrees

Introduction to Data Science Data Mining for Business Analytics

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Decision trees COMS 4771

Symbolic methods in TC: Decision Trees

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Machine Learning 3. week

C4.5 - pruning decision trees

Lecture 7 Decision Tree Classifier

Lecture VII: Classification I. Dr. Ouiem Bchir

Lecture 3: Decision Trees

Introduction to Machine Learning CMU-10701

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Decision Trees: Overfitting

Machine Learning 2nd Edi7on

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Jeffrey D. Ullman Stanford University

Informal Definition: Telling things apart

Classification: Decision Trees

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Decision Trees. Tirgul 5

Randomized Decision Trees

Decision Trees. CS 341 Lectures 8/9 Dan Sheldon

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Data Mining and Analysis: Fundamental Concepts and Algorithms

Learning Decision Trees

Empirical Risk Minimization, Model Selection, and Model Assessment

CSCI 5622 Machine Learning

Machine Learning & Data Mining

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Data Mining Classification

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Decision T ree Tree Algorithm Week 4 1

Data classification (II)

UVA CS 4501: Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Notes on Machine Learning for and

Decision Trees Entropy, Information Gain, Gain Ratio

Decision Tree And Random Forest

Decision Tree Learning

Classification and Regression Trees

26 Chapter 4 Classification

Machine Learning 2010

Supervised Learning via Decision Trees

Decision Trees. Gavin Brown

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Classification Using Decision Trees

Imagine we ve got a set of data containing several types, or classes. E.g. information about customers, and class=whether or not they buy anything.

Data Mining and Knowledge Discovery: Practice Notes

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Artificial Intelligence Decision Trees

Rule Generation using Decision Trees

Machine Learning, Fall 2009: Midterm

Decision trees. Decision tree induction - Algorithm ID3

M chi h n i e n L e L arni n n i g Decision Trees Mac a h c i h n i e n e L e L a e r a ni n ng

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

Tutorial 6. By:Aashmeet Kalra

Day 3: Classification, logistic regression

Improving M5 Model Tree by Evolutionary Algorithm

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

EVALUATING RISK FACTORS OF BEING OBESE, BY USING ID3 ALGORITHM IN WEKA SOFTWARE

Administrative notes. Computational Thinking ct.cs.ubc.ca

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Supervised Learning. George Konidaris

Data Mining Lab Course WS 2017/18

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

CSE 5243 INTRO. TO DATA MINING

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

From statistics to data science. BAE 815 (Fall 2017) Dr. Zifei Liu

Transcription:

Predictive Modeling: Classification Topic 6 Mun Yi

Agenda Models and Induction Entropy and Information Gain Tree-Based Classifier Probability Estimation 2

Introduction Key concept of BI: Predictive modeling Supervised segmentation: how can we segment the population with respect to something that we would like to predict or estim ate Which customers are likely to leave the company when their contracts expire?" Which potential customers are likely not to pay off their acc ount balances?" Technique: Find or select important, informative variables / attributes of the entities w.r.t. a target Is there one or more other variables that reduces our uncertainty about the value of the target? Select informative subsets in large databases 3

Models and Induction A model is a simplified representation of reality created to serve a purpose A predictive model is a formula for estimating the unknown value of interest: the target Classification/class-probab. estim. and regression models Prediction = estimate an unknown value Credit scoring, spam filtering, fraud detection Descriptive modeling: gain insight into the underlying phenomenon or process 4

Finding Informative Attributes Is there one or more other variables that reduce our uncertainty about the value of the target variable? Person Mortgage ID Age Gender Income Balance payment 123213 32 F 25000 32000 Y 17824 49 M 12000-3000 N 232897 60 F 8000 1000 Y 288822 28 M 9000 3000 Y...... 5

Main Questions How can we judge whether a variable contains important information about the target variable? How can we (automatically) obtain a selection of the more informative variables with respect to predicting the value of the target variable? Even better, can we obtain the ranking of the variables? 6

Example - A Set of People to be Classified Attributes: head-shape: square, circular body-shape: rectangular, oval body-color: black, white Target variable: Yes, No 7

Selecting Informative Attributes Which attribute is the most informative? Or the most useful for distinguishing between data instances? If we split our data according to this variable, we would like the resulting groups to be as pure as possible. By pure, we mean homogeneous with respect to the target variable. If every member of a group has the same value for the target, then the group is totally pure. 8

Example If this is our entire dataset: Then, we can obtain two pure groups by splitting according to body shape: 9

Concerns Attributes rarely split a group perfectly. Even if one subgroup happens to be pure, the other may not. Is a very small, pure group, a good thing? How should continuous and categorical attributes be handled? 10

Entropy and Information Gain Target variable has two (or more) categories: 1, 2 (, m) - Probability P1 for category 1 - Probability P2 for category 2 Entropy: H 2 ( X ) p1 log2 p1 p2 log2 p2 p m log p m 11

Entropy H 2 ( X ) p1 log 2 p1 p2 log 2 p2 p m log p m H ( X ) 0.5log 0.5 0.5log2 0.5 2 1 H ( X ) 0.75log 0.75 0.25log2 0.25 2 0.81 H ( X ) 1log2 1 0 12

Entropy 13

Information Gain Calculation of information gain (IG): IG (parent, children) = entropy(parent) [p(c1) entropy(c1)+p(c2) entropy(c2) + ] Parent Child 1 (c1) Child 2 (c2) Child Note: Higher IG indicates a more informative split by the variable. 14

Information Gain mortgage person id age>50 gender residence balance payment 123213 N F own 52000 delayed 17824 Y M own -3000 OK 232897 N F rent 70000 delayed 288822 Y M other 30000 delayed...... 15

Information Gain - delay - OK 16

Information Gain Entropy(parent) = [p( ) log2 p( ) +p( ) log2 p( )] = [0.53 ( 0.9) +0.47 ( 1.1)] = 0.99 (very impure!) Left child: entropy(balance < 50K) = [p( ) log2 p( ) + p( ) log2 p( )] = [0.92 ( 0.12) + 0.08 ( 3.7)] = 0.39 - delay - OK Right child: entropy(balance 50K) = [p( ) log2 p( ) + p( ) log2 p( )] = [0.24 ( 2.1) + 0.76 ( 0.39)] = 0.79 17

Information Gain Entropy(parent) = 0.99 Left child: entropy(balance < 50K) = 0.39 Right child: entropy(balance 50K) = 0.79 IG for the split based on balance variable : IG = entropy(parent) [p(balance < 50K) entropy(balance < 50K) +p(balance 50K) entropy(balance 50K)] = 0.99 [0.43 0.39 + 0.57 0.79] = 0.37 18

Information Gain entropy(parent) =0.99 entropy(residence=own) =0.54 entropy(residence=rent) =0.97 entropy(residence=other) =0.98 IG = 0.13 - delay - OK 19

So far We have measures of: Purity of the data (entropy) How informative a split by a variable is. We can identify and rank informative variables. Next we use this method to build our first supervised learning classifier a decision tree. 20

Decision Trees If we select multiple attributes each giving some information gain, it s not clear how to put them together decision trees The tree creates a segmentation of the data Each node in the tree contains a test of an attribute Each path eventually terminates at a leaf Each leaf corresponds to a segment, and the attributes and values along the path give the characteristics Each leaf contains a value for the target variable Decision trees are often used as predictive models 21

How to build a decision tree (1/4) EXPERT DECISION TREE GENERATED ELEMENTARY RULES INDUCTION DWH SAMPLE ELE MENTARY RU LES heuristic enumerative heuristic enumerative Manually build the tree based on expert knowledge very time-consuming trees are sometimes corrupt (redundancy, contradictions, non-completenes, inefficient) Build the tree automatically by induction recursively partition the instan ces based on their attributes (d ivide-and-conquer) easy to understand relatively efficient 22

How to build a decision tree (2/4) Recursively apply attribute selection to find the best attribute to partition the data set The goal at each step is to select an attribute to partition the current group into subgroups that are as pure as possible w.r.t. the target variable 23

How to build a decision tree (3/4) 24

How to build a decision tree (4/4) 25

Dataset mortgage person id age>50 gender residence Balance>= 50,000 payment delay 123213 N F own N delayed 17824 Y M own Y OK 232897 N F rent N delayed 288822 Y M other N delayed...... Based on this dataset we will build a tree-based classifier. 26

Tree Structure All customers Balance 50,000 Balance<50,000 Residence = Own OK Residence = Rent OK Residence = other Delay Age 50 OK Age<50 Delay 27

Tree Structure All customers (14 Delay,16 OK) 28

Tree Structure balance residence gender age cust id Information Gain 0 0.1 0.2 0.3 0.4 0.5 All customers (14 Delay,16 OK) 29

Tree Structure All customers (14 Delay,16 OK) Balance 50,000 (4 delay, 12 OK) Balance<50,000 (2 OK, 12 delay) 30

Tree Structure All customers (14 Delay,16 OK) Balance 50,000 (4 delay, 12 OK) Balance<50,000 (2 OK, 12 delay) 31

Tree Structure All customers (14 Delay,16 OK) Balance 50,000 (4 delay, 12 OK) Balance<50,000 (2 OK, 12 delay) Age 50 OK (1 delay,2 OK) Age<50 Delay (11 delay,0 OK) 32

Tree Structure All customers (14 Delay,16 OK) Balance 50,000 (4 delay, 12 OK) Balance<50,000 (2 OK, 12 delay) Information Gain residence gender age cust id 0 0.05 0.1 0.15 0.2 Age 50 OK (1 delay,2 OK) Age<50 Delay (11 delay,0 OK) 33

Tree Structure All customers (14 Delay,16 OK) Balance 50,000 (4 delay, 12 OK) Balance<50,000 (2 OK, 12 delay) Residence = Own OK (0 delay, 5 OK) Residence = Rent OK (1 delay, 5 OK) Residence = Other Delay (3 delay, 2 OK) Age 50 OK (1 delay,2 OK) Age<50 Delay (11 delay,0 OK) 34

Tree Structure All customers Balance 50,000 Balance<50,000 Residence = Own OK Residence = Rent OK Residence = Other Delay Age 50 OK Age<50 Delay id Age>50 Residenc Gender e Balance >=50K Delay 87594 Y F own <50K??? 35

Information gain for numeric attributes "Discretize" numeric attributes by split points How to choose the split points that provide the highest information gain? Segmentation for regression problems Information gain is not the right measure We need a measure of purity for numeric values Look at reduction of VARIANCE 36

Open Issues All customers (14 Delay,16 OK) Balance 50,000 (4 delay, 12 OK) Balance<50,000 (2 OK, 12 delay) Residence = Own OK (0 delay, 5 OK) Residence = Rent RESIDENCE = Other OK Delay (1 delay, 5 OK) (3 delay, 2 OK) Age 50 OK (1 delay,2 OK) Age<50 Delay (11 delay,0 OK) 37

Probability Estimation (1/3) We often need a more informative prediction than just a classification E.g. allocate your budget to the instances with the highest expected loss More sophisticated decision-making process Classification may oversimplify the problem E.g. if all segments have a probability of <0.5 for write-off, every leaf will be labeled not write-off" We would like each segment (leaf) to be assigned an estimate of the probability of membership in the different classes Probability estimation tree 38

Probability Estimation (2/3) Tree induction can easily produce probability estimation trees instead of simple classification trees Instance counts at each leaf provide class probability estimates Frequency-based estimate of class membership: if a leaf contains positive and negative instances, the probability of any new instance being positive may be estimated as n/(n+m). Approach may be too optimistic for segments with a very small number of instances ( overfitting) Smoothed version of frequency-based estimate by Laplace correction, which moderates the influence of leaves with n+1 only a few instances: p(c) = with n as number of instances that n+m+2 belong to class c and m as the number of instances not belonging to class c. 39

Probability Estimation (3/3) Effect of Laplace correction o n several class ratios as the number of instances increas es (2/3, 3/4, 4/5) Example: A leaf of the classification tree that has 2 pos. instances and no negative instances would produce the same f-b estimate ( = 1) as a leaf node with 20 pos. and no negatives. The Laplace correction smooths the estimate of the first lea f down to = 0.75 to reflect this uncertainty, but it has much less effect on the leaf with 20 instances ( 0.95) 40

Example - The Churn Problem (1/3) Solve the churn problem by tree induction Historical data set of 20,000 customers Each customer either had stayed with the company or left Customers are described by the following variables: We want to use this data to predict which new customers are going to churn. 41

The Churn Problem (2/3) How good are each of these variables individually? Measure the information gain of each variable Compute information gain for each variable independently 42

The Churn Problem (3/3) The highest information gain feature (HOUSE) is at the root of the tree. Why is the order of features chosen for the tree different from the ranking? When to stop building the tree? How do we know that this is a good model? 43

When to Stop Building the Tree Tree pruning identifies and removes subtrees within a decision tree that are likely to be due to noise and sample variance in the training set. Pre-pruning a tree is pruned by stopping its construction early specifying a threshold for - the number of instances per node - information gain - depth of the tree Post-pruning a tree is pruned after tree induction algorithm is allowed to grow a tree to completion 44

Decision Trees with R R also allows for much finer control of the decision tree construction. The script below demonstrates how to create a simple tree for the Iris data set using a training set of 75 records: >library(rpart) >iris.train <- c(sample(1:150,75)) >iris.dtree <- rpart(species~.,data=iris, subset=iris.train) >library(rattle) >drawtreenodes(iris.dtree) >table(predict(iris.dtree,iris[-iris.train,], type= class ), iris[-iris.train, Species ]) 45