Predictive Modeling: Classification Topic 6 Mun Yi
Agenda Models and Induction Entropy and Information Gain Tree-Based Classifier Probability Estimation 2
Introduction Key concept of BI: Predictive modeling Supervised segmentation: how can we segment the population with respect to something that we would like to predict or estim ate Which customers are likely to leave the company when their contracts expire?" Which potential customers are likely not to pay off their acc ount balances?" Technique: Find or select important, informative variables / attributes of the entities w.r.t. a target Is there one or more other variables that reduces our uncertainty about the value of the target? Select informative subsets in large databases 3
Models and Induction A model is a simplified representation of reality created to serve a purpose A predictive model is a formula for estimating the unknown value of interest: the target Classification/class-probab. estim. and regression models Prediction = estimate an unknown value Credit scoring, spam filtering, fraud detection Descriptive modeling: gain insight into the underlying phenomenon or process 4
Finding Informative Attributes Is there one or more other variables that reduce our uncertainty about the value of the target variable? Person Mortgage ID Age Gender Income Balance payment 123213 32 F 25000 32000 Y 17824 49 M 12000-3000 N 232897 60 F 8000 1000 Y 288822 28 M 9000 3000 Y...... 5
Main Questions How can we judge whether a variable contains important information about the target variable? How can we (automatically) obtain a selection of the more informative variables with respect to predicting the value of the target variable? Even better, can we obtain the ranking of the variables? 6
Example - A Set of People to be Classified Attributes: head-shape: square, circular body-shape: rectangular, oval body-color: black, white Target variable: Yes, No 7
Selecting Informative Attributes Which attribute is the most informative? Or the most useful for distinguishing between data instances? If we split our data according to this variable, we would like the resulting groups to be as pure as possible. By pure, we mean homogeneous with respect to the target variable. If every member of a group has the same value for the target, then the group is totally pure. 8
Example If this is our entire dataset: Then, we can obtain two pure groups by splitting according to body shape: 9
Concerns Attributes rarely split a group perfectly. Even if one subgroup happens to be pure, the other may not. Is a very small, pure group, a good thing? How should continuous and categorical attributes be handled? 10
Entropy and Information Gain Target variable has two (or more) categories: 1, 2 (, m) - Probability P1 for category 1 - Probability P2 for category 2 Entropy: H 2 ( X ) p1 log2 p1 p2 log2 p2 p m log p m 11
Entropy H 2 ( X ) p1 log 2 p1 p2 log 2 p2 p m log p m H ( X ) 0.5log 0.5 0.5log2 0.5 2 1 H ( X ) 0.75log 0.75 0.25log2 0.25 2 0.81 H ( X ) 1log2 1 0 12
Entropy 13
Information Gain Calculation of information gain (IG): IG (parent, children) = entropy(parent) [p(c1) entropy(c1)+p(c2) entropy(c2) + ] Parent Child 1 (c1) Child 2 (c2) Child Note: Higher IG indicates a more informative split by the variable. 14
Information Gain mortgage person id age>50 gender residence balance payment 123213 N F own 52000 delayed 17824 Y M own -3000 OK 232897 N F rent 70000 delayed 288822 Y M other 30000 delayed...... 15
Information Gain - delay - OK 16
Information Gain Entropy(parent) = [p( ) log2 p( ) +p( ) log2 p( )] = [0.53 ( 0.9) +0.47 ( 1.1)] = 0.99 (very impure!) Left child: entropy(balance < 50K) = [p( ) log2 p( ) + p( ) log2 p( )] = [0.92 ( 0.12) + 0.08 ( 3.7)] = 0.39 - delay - OK Right child: entropy(balance 50K) = [p( ) log2 p( ) + p( ) log2 p( )] = [0.24 ( 2.1) + 0.76 ( 0.39)] = 0.79 17
Information Gain Entropy(parent) = 0.99 Left child: entropy(balance < 50K) = 0.39 Right child: entropy(balance 50K) = 0.79 IG for the split based on balance variable : IG = entropy(parent) [p(balance < 50K) entropy(balance < 50K) +p(balance 50K) entropy(balance 50K)] = 0.99 [0.43 0.39 + 0.57 0.79] = 0.37 18
Information Gain entropy(parent) =0.99 entropy(residence=own) =0.54 entropy(residence=rent) =0.97 entropy(residence=other) =0.98 IG = 0.13 - delay - OK 19
So far We have measures of: Purity of the data (entropy) How informative a split by a variable is. We can identify and rank informative variables. Next we use this method to build our first supervised learning classifier a decision tree. 20
Decision Trees If we select multiple attributes each giving some information gain, it s not clear how to put them together decision trees The tree creates a segmentation of the data Each node in the tree contains a test of an attribute Each path eventually terminates at a leaf Each leaf corresponds to a segment, and the attributes and values along the path give the characteristics Each leaf contains a value for the target variable Decision trees are often used as predictive models 21
How to build a decision tree (1/4) EXPERT DECISION TREE GENERATED ELEMENTARY RULES INDUCTION DWH SAMPLE ELE MENTARY RU LES heuristic enumerative heuristic enumerative Manually build the tree based on expert knowledge very time-consuming trees are sometimes corrupt (redundancy, contradictions, non-completenes, inefficient) Build the tree automatically by induction recursively partition the instan ces based on their attributes (d ivide-and-conquer) easy to understand relatively efficient 22
How to build a decision tree (2/4) Recursively apply attribute selection to find the best attribute to partition the data set The goal at each step is to select an attribute to partition the current group into subgroups that are as pure as possible w.r.t. the target variable 23
How to build a decision tree (3/4) 24
How to build a decision tree (4/4) 25
Dataset mortgage person id age>50 gender residence Balance>= 50,000 payment delay 123213 N F own N delayed 17824 Y M own Y OK 232897 N F rent N delayed 288822 Y M other N delayed...... Based on this dataset we will build a tree-based classifier. 26
Tree Structure All customers Balance 50,000 Balance<50,000 Residence = Own OK Residence = Rent OK Residence = other Delay Age 50 OK Age<50 Delay 27
Tree Structure All customers (14 Delay,16 OK) 28
Tree Structure balance residence gender age cust id Information Gain 0 0.1 0.2 0.3 0.4 0.5 All customers (14 Delay,16 OK) 29
Tree Structure All customers (14 Delay,16 OK) Balance 50,000 (4 delay, 12 OK) Balance<50,000 (2 OK, 12 delay) 30
Tree Structure All customers (14 Delay,16 OK) Balance 50,000 (4 delay, 12 OK) Balance<50,000 (2 OK, 12 delay) 31
Tree Structure All customers (14 Delay,16 OK) Balance 50,000 (4 delay, 12 OK) Balance<50,000 (2 OK, 12 delay) Age 50 OK (1 delay,2 OK) Age<50 Delay (11 delay,0 OK) 32
Tree Structure All customers (14 Delay,16 OK) Balance 50,000 (4 delay, 12 OK) Balance<50,000 (2 OK, 12 delay) Information Gain residence gender age cust id 0 0.05 0.1 0.15 0.2 Age 50 OK (1 delay,2 OK) Age<50 Delay (11 delay,0 OK) 33
Tree Structure All customers (14 Delay,16 OK) Balance 50,000 (4 delay, 12 OK) Balance<50,000 (2 OK, 12 delay) Residence = Own OK (0 delay, 5 OK) Residence = Rent OK (1 delay, 5 OK) Residence = Other Delay (3 delay, 2 OK) Age 50 OK (1 delay,2 OK) Age<50 Delay (11 delay,0 OK) 34
Tree Structure All customers Balance 50,000 Balance<50,000 Residence = Own OK Residence = Rent OK Residence = Other Delay Age 50 OK Age<50 Delay id Age>50 Residenc Gender e Balance >=50K Delay 87594 Y F own <50K??? 35
Information gain for numeric attributes "Discretize" numeric attributes by split points How to choose the split points that provide the highest information gain? Segmentation for regression problems Information gain is not the right measure We need a measure of purity for numeric values Look at reduction of VARIANCE 36
Open Issues All customers (14 Delay,16 OK) Balance 50,000 (4 delay, 12 OK) Balance<50,000 (2 OK, 12 delay) Residence = Own OK (0 delay, 5 OK) Residence = Rent RESIDENCE = Other OK Delay (1 delay, 5 OK) (3 delay, 2 OK) Age 50 OK (1 delay,2 OK) Age<50 Delay (11 delay,0 OK) 37
Probability Estimation (1/3) We often need a more informative prediction than just a classification E.g. allocate your budget to the instances with the highest expected loss More sophisticated decision-making process Classification may oversimplify the problem E.g. if all segments have a probability of <0.5 for write-off, every leaf will be labeled not write-off" We would like each segment (leaf) to be assigned an estimate of the probability of membership in the different classes Probability estimation tree 38
Probability Estimation (2/3) Tree induction can easily produce probability estimation trees instead of simple classification trees Instance counts at each leaf provide class probability estimates Frequency-based estimate of class membership: if a leaf contains positive and negative instances, the probability of any new instance being positive may be estimated as n/(n+m). Approach may be too optimistic for segments with a very small number of instances ( overfitting) Smoothed version of frequency-based estimate by Laplace correction, which moderates the influence of leaves with n+1 only a few instances: p(c) = with n as number of instances that n+m+2 belong to class c and m as the number of instances not belonging to class c. 39
Probability Estimation (3/3) Effect of Laplace correction o n several class ratios as the number of instances increas es (2/3, 3/4, 4/5) Example: A leaf of the classification tree that has 2 pos. instances and no negative instances would produce the same f-b estimate ( = 1) as a leaf node with 20 pos. and no negatives. The Laplace correction smooths the estimate of the first lea f down to = 0.75 to reflect this uncertainty, but it has much less effect on the leaf with 20 instances ( 0.95) 40
Example - The Churn Problem (1/3) Solve the churn problem by tree induction Historical data set of 20,000 customers Each customer either had stayed with the company or left Customers are described by the following variables: We want to use this data to predict which new customers are going to churn. 41
The Churn Problem (2/3) How good are each of these variables individually? Measure the information gain of each variable Compute information gain for each variable independently 42
The Churn Problem (3/3) The highest information gain feature (HOUSE) is at the root of the tree. Why is the order of features chosen for the tree different from the ranking? When to stop building the tree? How do we know that this is a good model? 43
When to Stop Building the Tree Tree pruning identifies and removes subtrees within a decision tree that are likely to be due to noise and sample variance in the training set. Pre-pruning a tree is pruned by stopping its construction early specifying a threshold for - the number of instances per node - information gain - depth of the tree Post-pruning a tree is pruned after tree induction algorithm is allowed to grow a tree to completion 44
Decision Trees with R R also allows for much finer control of the decision tree construction. The script below demonstrates how to create a simple tree for the Iris data set using a training set of 75 records: >library(rpart) >iris.train <- c(sample(1:150,75)) >iris.dtree <- rpart(species~.,data=iris, subset=iris.train) >library(rattle) >drawtreenodes(iris.dtree) >table(predict(iris.dtree,iris[-iris.train,], type= class ), iris[-iris.train, Species ]) 45