Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Statistisches Data Mining (StDM) Woche 11 Oliver Dürr Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften oliver.duerr@zhaw.ch Winterthur, 29 November 2016 1

Multitasking senkt Lerneffizienz: Keine Laptops im Theorie-Unterricht Deckel zu oder fast zu (Sleep modus)

Overview of classification (until the end to the semester) Classifiers K-Nearest-Neighbors (KNN) Logistic Regression Linear discriminant analysis Support Vector Machine (SVM) Classification Trees Neural networks NN Deep Neural Networks (e.g. CNN, RNN) Evaluation Cross validation Performance measures ROC Analysis / Lift Charts Theoretical Guidance / General Ideas Combining classifiers Bagging Boosting Random Forest Bayes Classifier Bias Variance Trade off (Overfitting) Feature Engineering Feature Extraction Feature Selection

Decision Trees Chapter 8.1 in ILSR

Note on ISLR In ISLR they also include trees for regression. Here we focus on trees for classification

10 Example of a Decision Tree for classification Tid Refund Marital Status Taxable Income Cheat Splitting Attributes 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Refund Yes No MarSt Single, Divorced TaxInc < 80K > 80K YES Married Training Data Model: Decision Tree Source: Tan, Steinbach, Kumar 6

10 Another Example of Decision Tree Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Cheat 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Married MarSt Yes Single, Divorced Refund No TaxInc < 80K > 80K YES There could be more than one tree that fits the same data! Source: Tan, Steinbach, Kumar 7

10 10 Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No Tree Induction algorithm 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes Induction 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No Learn Model 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class Apply Model Model Decision Tree 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? Deduction 14 No Small 95K? 15 No Large 67K? Test Set Source: Tan, Steinbach, Kumar 8

10 Apply Model to Test Data Start from the root of tree. Test Data Refund Marital Status Taxable Income Cheat Refund No Married 80K? Yes No Single, Divorced MarSt Married TaxInc < 80K > 80K YES Source: Tan, Steinbach, Kumar 9

10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? Single, Divorced MarSt Married TaxInc < 80K > 80K YES Source: Tan, Steinbach, Kumar 10

10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? Single, Divorced MarSt Married TaxInc < 80K > 80K YES Source: Tan, Steinbach, Kumar 11

10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? Single, Divorced MarSt Married TaxInc < 80K > 80K YES Source: Tan, Steinbach, Kumar 12

10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? Single, Divorced MarSt Married TaxInc < 80K > 80K YES Source: Tan, Steinbach, Kumar 13

10 Apply Model to Test Data Test Data Refund Marital Status Taxable Income Cheat Refund No Married 80K? Yes No Single, Divorced MarSt Married Assign Cheat to No TaxInc < 80K > 80K YES Source: Tan, Steinbach, Kumar 14

10 10 Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? 15 No Large 67K? Test Set Induction Deduction Tree Induction algorithm Learn Model Apply Model Model Decision Tree Source: Tan, Steinbach, Kumar 15

Restrict to recursive Partitioning by binary splits There many approaches C 4.5 / C5 Algorithms, SPRINT Here top down splitting until no further split possible (or other criteria) 16

How to train a classification tree Score: Node impurity (next slides) Draw 3 splits to separate the data. Draw the corresponding tree x2 0.4 0.25 x1 17

Construction of a classification tree: Minimize the impurity in each node Parent Node p is split into 2 par00ons Maximize Gain over possible splits: 2 ni GAINsplit = IMPURITY ( p) IMPURITY ( i) i 1 n = n i is number of records in par00on i Possible Impurity Measures: - Gini index - entropy - misclassifica0on error Source: Tan, Steinbach, Kumar 19

Construction of a classification tree: Minimize the impurity in each node We have n c =4 different classes: red, green, blue, olive Before split Possible Split Before split Possible Split 20

The three most common impurity measures p( j t) is the rela0ve frequency of class j at node t n c is the number of different classes Gini Index: GINI () t 1 [ p( j t)] n c = j= 1 2 Entropy: n C j=1 ( ) Entropy(t) = p( j t)log 2 p( j t) Classifica0on error: Classification Error() t = 1 max p( j t) j {1,..., n c ) 21

Computing the Gini Index for three 2-class examples Gini Index at a given node t (in case of n c =2 different classes): Distribution of class 1 and class2 in node t: C1 0 C2 6 2 ( ) GINI ( t) = 1 [ p( j t)] = 1 p( class.1 t) + p( class.2 t) j = 1 2 2 2 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 P(C1) 2 P(C2) 2 = 1 0 1 = 0 C1 1 C2 5 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 (1/6) 2 (5/6) 2 = 0.278 C1 2 C2 4 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 (2/6) 2 (4/6) 2 = 0.444 Source: Tan, Steinbach, Kumar 22

Computing the Entropy for three examples Entropy at a given node t (in case of 2 different classes): ( ) Entropy( t) = p( class.1) log (p( class.1)) + p( class.2) log (p( class.2)) 2 2 C1 0 C2 6 p(c1) = 0/6 = 0 p(c2) = 6/6 = 1 Entropy = 0 log(0) 1 log(1) = 0 0 = 0 C1 1 C2 5 p(c1) = 1/6 p(c2) = 5/6 Entropy = (1/6) log 2 (1/6) (5/6) log 2 (1/6) = 0.65 C1 2 C2 4 p(c1) = 2/6 p(c2) = 4/6 Entropy = (2/6) log 2 (2/6) (4/6) log 2 (4/6) = 0.92 Source: Tan, Steinbach, Kumar 23

Computing the miss-classification error for three examples Classification error at a node t (in case of 2 different classes) : ( ) Classification Error( t) = 1 max p( class.1), p(class.2) C1 0 C2 6 p(c1) = 0/6 = 0 p(c2) = 6/6 = 1 Error = 1 max (0, 1) = 1 1 = 0 C1 1 C2 5 p(c1) = 1/6 p(c2) = 5/6 Error = 1 max (1/6, 5/6) = 1 5/6 = 1/6 C1 2 C2 4 p(c1) = 2/6 p(c2) = 4/6 Error = 1 max (2/6, 4/6) = 1 4/6 = 1/3 Source: Tan, Steinbach, Kumar 24

Compare the three impurity measures for a 2-class problem Impurity p: proportion of class 1 Source: Tan, Steinbach, Kumar 25

Using a tree for classification problems Score: Node impurity Model class label is given by the majority of all observation labels in the region x1<0.25 no x2<0.4 no yes yes X x2 0.4? fˆb = o? X 0.25 x1 26

Decision Tree in R # rpart for recursive partitioning library(rpart) d = rpart(species ~., data=iris) plot(d,margin=0.2, uniform=true) text(d, use.n = TRUE)

Overfitting Overfitting The larger the tree, the lower the error on the training set. low bias However, training error can get worse. high variance Taken from ISLR

Pruning First Strategy: Stop growing the tree if impurity measure gain is small. To short-sighted: a seemingly worthless split early on in the tree might be followed by a very good split A better strategy is to grow a very large tree to the end, and then prune it back in order to obtain a subtree In rpart the pruning controlled with a complexity parameter cp cp is a parameter in regressions trees which can be optimized (like k in knn) Later: Crossvalidation a way to systematically find the best meta parameter

Pruning Original Tree t1 <- rpart(sex ~., data=crabs) plot(t1, margin=0.2, uniform=true);text(t1,use.n=true) t2 = prune(t1, cp=0.05) plot(t2);text(t2,use.n=true) t3 = prune(t1, cp=0.1) plot(t3);text(t3,use.n=true) t4 = prune(t1, cp=0.2) plot(t4);text(t4,use.n=true) cp = 0.05 cp = 0.1 cp = 0.2

Trees vs Linear Models Slide Credit: ISLR

Pros and cons of tree models Pros - Interpretability - Robust to outliers in the input variables - Can capture non-linear structures - Can capture local interac0ons very well - Low bias if appropriate input variables are available and tree has sufficient depth. Cons - High varia0on (instability of trees) - Tend to overfit - Needs big datasets to capture addi0ve structures - Inefficient for capturing linear structures 32

Joining Forces

Which supervised learning method is best? Bild: Seni & Elder

Bundling improves performance Bild: Seni & Elder

Evolving results I Low hanging fruits and slowed down progress After 3 weeks, at least 40 teams had improved the Netflix classifier Top teams showed about 5% improvement However, improvement slowed: from hpp://www.research.ap.com/~volinsky/netlix/

Details: Gravity Quote [home.mit.bme.hu/~gtakacs/download/gravity.pdf]

Details: When Gravity and Dinosaurs Unite Quote Our common team blends the result of team Gravity and team Dinosaur Planet.

Details: BellKor / KorBell Quote Our final solution (RMSE=0.8712) consists of blending 107 individual results.

Evolving results III Final results The winner was an ensemble of ensembles (including BellKor) Gradient boosted decision trees [http://www.netflixprize.com/assets/grandprize2009_bpc_bellkor.pdf]

Overview of Ensemble Methods Many instances of the same classifier Bagging (bootstrapping & aggregating) Create new data using bootstrap Train classifiers on new data Average over all predictions (e.g. take majority vote) Bagged Trees Use bagging with decision trees Random Forest Bagged trees with special trick Boosting An iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records. Combining classifiers Weighted averaging over predictions Stacking classifiers Use output of classifiers as input for a new classifier

Bagging

Idee des Bootstraps Idee: Die Sampling Verteilung liegt nahegenug an der wahren

Bagging: Bootstrap Aggregating D Original Training data Step 1: Create Multiple Data Sets... D 1 D 2 D t-1 D t Bootstrap Step 2: Build Multiple Classifiers C 1 C 2 C t -1 C t Step 3: Combine Classifiers C * Aggregating: mean, majority vote Source: Tan, Steinbach, Kumar

Why does it work? Suppose there are 25 base classifiers Each classifier has error rate, ε = 0.35 Assume classifiers are independent (that s the hardest assumption) Take majority vote Majority voter is wrong if 13 or more are wrong Number of wrong predictors X ~ Bin(size=25, p 0 =0.35) > 1 - pbinom(12, size=25, prob = 0.35) [1] 0.06044491 25 i= 13 25 i ε (1 ε ) i 25 i = 0.06

Why does it work? 25 Base Classifiers Ensembles are only better than one classifier, if each classifier is better than random guessing!